Home Credit Default Risk

Part 1 | Exploratory Data Analysis | IS 6812

Author

Adam Bushman (u6049169)

Published

September 10, 2024

Introduction

The Project

Business Problem

Home Credit cares deeply about the population without sufficient credit history and aims to improve inclusion for this underrepresented group by designing prediction models for loan repayment, such that capable borrowers are not denied solely due to absence of credit while protecting against defaults.

Benefit of a Solution

A more accurate prediction model will help Home Credit a) provide lending terms and b) sufficiently balance risk such that financial inclusion is expanded.

Success Metrics

The success of the project will be measured in accurately predicting loan defaults, so that an increase in loan approval rates among this underrepresented group is accompanied by a stable or even reduction in default rates.

Analytics Approach

A supervised machine learning classification approach will be used to predict loan default risk, leveraging alternative data sources such as telco and transactional information, with the target variable being loan repayment.

Scope

The project will deliver a predictive model for assessing loan repayment abilities.

Details

The project will be executed by a team of business analysts, equipped with data processing and modeling skills. The project will feature iterative milestones for data exploration, model development, and evaluation, culminating in a final model by calendar year end.

My Approach

For this exploratory data analysis, I plan to use the following languages, APIs, and tools:

  • Quarto: probably “the” notebook of choice for the scientific community; built and maintained by Posit, it’s flexible, powerful, and beautiful.
  • Python: I usually default to R for analytical tasks so I could use the extra practice in Python.
  • DuckDB: an in-process analytical database engine. It is extremely fast and features a convenient syntax. It’s gathered tremendous steam in the industry, even for analysis tasks.

Within the Python ecosystem I’ll use several libraries that will augment the analytical process. These include, but are not limited to:

  • pandas: probably the foundational package for data analysis in Python
  • scikit-learn: a comprehensive API for data science models and workflows
  • statsmodels: one of the best statistics libraries in Python
  • plotnine: the {ggplot2} equivalent in Python, bringing “grammar of graphics”
  • skimpy: a neat package that started in R designed to summarize tabular data in a digestible way

Analysis Preparation

Data Source

The data is sourced from Kaggle where Home Credit hosted a competitiion back in the summer of 2018. Data was downloaded therefrom on August 20th, 2024 and used concurrently throughout the Capstone course at University of Utah, Fall 2024.

Loading Data

We’ll start off importing duckdb. We’ll do much of the data processing work using this powerful SQL engine.

import duckdb

With access to the API, we can begin to query our data located in files. We need only setup “relations” (informing duckdb where the files are located).

# DuckDB relations
# These are tables against which we can query using the DuckDB API

duckdb.read_csv("data/application_test.csv")
duckdb.read_csv("data/application_train.csv")
duckdb.read_csv("data/bureau_balance.csv")
duckdb.read_csv("data/bureau.csv")
duckdb.read_csv("data/credit_card_balance.csv")
duckdb.read_csv("data/installments_payments.csv")
duckdb.read_csv("data/POS_CASH_balance.csv")
duckdb.read_csv("data/previous_application.csv")

This approach isn’t what you’d call “conventional”. However, DuckDB is uniquely suited for this type of workload. We have multiple files with millions of records each. DuckDb is more performant over other Python ecosystem libraries and packages in the R ecosystem.

Okay, so it’s well suited to the problem, but how are we expecting to work with these data? We’ll walk through some use cases quickly to demonstrate how seamless it will actually be.

Working With the Data

Straight away, we can interact with these files with plain SQL. For example, with a simple query (below) we can look at the first 10 rows of the bureau.csv file.

duckdb.sql("SELECT * FROM 'data/bureau.csv' LIMIT 10").show()
┌────────────┬──────────────┬───────────────┬─────────────────┬─────────────┬────────────────────┬─────────────────────┬───────────────────┬────────────────────────┬────────────────────┬────────────────┬─────────────────────┬──────────────────────┬────────────────────────┬─────────────────┬────────────────────┬─────────────┐
│ SK_ID_CURR │ SK_ID_BUREAU │ CREDIT_ACTIVE │ CREDIT_CURRENCY │ DAYS_CREDIT │ CREDIT_DAY_OVERDUE │ DAYS_CREDIT_ENDDATE │ DAYS_ENDDATE_FACT │ AMT_CREDIT_MAX_OVERDUE │ CNT_CREDIT_PROLONG │ AMT_CREDIT_SUM │ AMT_CREDIT_SUM_DEBT │ AMT_CREDIT_SUM_LIMIT │ AMT_CREDIT_SUM_OVERDUE │   CREDIT_TYPE   │ DAYS_CREDIT_UPDATE │ AMT_ANNUITY │
│   int64    │    int64     │    varchar    │     varchar     │    int64    │       int64        │       double        │      double       │         double         │       int64        │     double     │       double        │        double        │         double         │     varchar     │       int64        │   double    │
├────────────┼──────────────┼───────────────┼─────────────────┼─────────────┼────────────────────┼─────────────────────┼───────────────────┼────────────────────────┼────────────────────┼────────────────┼─────────────────────┼──────────────────────┼────────────────────────┼─────────────────┼────────────────────┼─────────────┤
│     215354 │      5714462 │ Closed        │ currency 1      │        -497 │                  0 │              -153.0 │            -153.0 │                   NULL │                  0 │        91323.0 │                 0.0 │                 NULL │                    0.0 │ Consumer credit │               -131 │        NULL │
│     215354 │      5714463 │ Active        │ currency 1      │        -208 │                  0 │              1075.0 │              NULL │                   NULL │                  0 │       225000.0 │            171342.0 │                 NULL │                    0.0 │ Credit card     │                -20 │        NULL │
│     215354 │      5714464 │ Active        │ currency 1      │        -203 │                  0 │               528.0 │              NULL │                   NULL │                  0 │       464323.5 │                NULL │                 NULL │                    0.0 │ Consumer credit │                -16 │        NULL │
│     215354 │      5714465 │ Active        │ currency 1      │        -203 │                  0 │                NULL │              NULL │                   NULL │                  0 │        90000.0 │                NULL │                 NULL │                    0.0 │ Credit card     │                -16 │        NULL │
│     215354 │      5714466 │ Active        │ currency 1      │        -629 │                  0 │              1197.0 │              NULL │                77674.5 │                  0 │      2700000.0 │                NULL │                 NULL │                    0.0 │ Consumer credit │                -21 │        NULL │
│     215354 │      5714467 │ Active        │ currency 1      │        -273 │                  0 │             27460.0 │              NULL │                    0.0 │                  0 │       180000.0 │            71017.38 │            108982.62 │                    0.0 │ Credit card     │                -31 │        NULL │
│     215354 │      5714468 │ Active        │ currency 1      │         -43 │                  0 │                79.0 │              NULL │                    0.0 │                  0 │        42103.8 │             42103.8 │                  0.0 │                    0.0 │ Consumer credit │                -22 │        NULL │
│     162297 │      5714469 │ Closed        │ currency 1      │       -1896 │                  0 │             -1684.0 │           -1710.0 │                14985.0 │                  0 │       76878.45 │                 0.0 │                  0.0 │                    0.0 │ Consumer credit │              -1710 │        NULL │
│     162297 │      5714470 │ Closed        │ currency 1      │       -1146 │                  0 │              -811.0 │            -840.0 │                    0.0 │                  0 │       103007.7 │                 0.0 │                  0.0 │                    0.0 │ Consumer credit │               -840 │        NULL │
│     162297 │      5714471 │ Active        │ currency 1      │       -1146 │                  0 │              -484.0 │              NULL │                    0.0 │                  0 │         4500.0 │                 0.0 │                  0.0 │                    0.0 │ Credit card     │               -690 │        NULL │
├────────────┴──────────────┴───────────────┴─────────────────┴─────────────┴────────────────────┴─────────────────────┴───────────────────┴────────────────────────┴────────────────────┴────────────────┴─────────────────────┴──────────────────────┴────────────────────────┴─────────────────┴────────────────────┴─────────────┤
│ 10 rows                                                                                                                                                                                                                                                                                                                 17 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

DuckDB does a nice job styling the output and including helpful details, such as data types.

But don’t think we’re “stuck” in the world of SQL for the duration of this analysis; far from it. DuckDB is very adept at refactoring the data for use with the other packages we’ll be using. Let’s bring in pandas and see how this works:

import pandas as pd
df = duckdb.sql("SELECT * FROM 'data/bureau.csv' LIMIT 10").fetchdf()

type(df)
pandas.core.frame.DataFrame

As you can see, we now have a pandas data frame. We could run a model with scikit-learn, generate a visualization with plotnine, or perform complex, custom logic ill-fit for SQL. For now, we’ll just select every other column:

df_sub = df.loc[:,::2]

df_sub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   SK_ID_CURR              10 non-null     int64  
 1   CREDIT_ACTIVE           10 non-null     object 
 2   DAYS_CREDIT             10 non-null     int64  
 3   DAYS_CREDIT_ENDDATE     9 non-null      float64
 4   AMT_CREDIT_MAX_OVERDUE  6 non-null      float64
 5   AMT_CREDIT_SUM          10 non-null     float64
 6   AMT_CREDIT_SUM_LIMIT    5 non-null      float64
 7   CREDIT_TYPE             10 non-null     object 
 8   AMT_ANNUITY             0 non-null      float64
dtypes: float64(5), int64(2), object(2)
memory usage: 852.0+ bytes

At this point, we may be ready to leverage the speed and efficiency of DuckDB. So we can just switch right back!

duckdb.sql("SELECT * FROM df_sub").show()
┌────────────┬───────────────┬─────────────┬─────────────────────┬────────────────────────┬────────────────┬──────────────────────┬─────────────────┬─────────────┐
│ SK_ID_CURR │ CREDIT_ACTIVE │ DAYS_CREDIT │ DAYS_CREDIT_ENDDATE │ AMT_CREDIT_MAX_OVERDUE │ AMT_CREDIT_SUM │ AMT_CREDIT_SUM_LIMIT │   CREDIT_TYPE   │ AMT_ANNUITY │
│   int64    │    varchar    │    int64    │       double        │         double         │     double     │        double        │     varchar     │   double    │
├────────────┼───────────────┼─────────────┼─────────────────────┼────────────────────────┼────────────────┼──────────────────────┼─────────────────┼─────────────┤
│     215354 │ Closed        │        -497 │              -153.0 │                   NULL │        91323.0 │                 NULL │ Consumer credit │        NULL │
│     215354 │ Active        │        -208 │              1075.0 │                   NULL │       225000.0 │                 NULL │ Credit card     │        NULL │
│     215354 │ Active        │        -203 │               528.0 │                   NULL │       464323.5 │                 NULL │ Consumer credit │        NULL │
│     215354 │ Active        │        -203 │                NULL │                   NULL │        90000.0 │                 NULL │ Credit card     │        NULL │
│     215354 │ Active        │        -629 │              1197.0 │                77674.5 │      2700000.0 │                 NULL │ Consumer credit │        NULL │
│     215354 │ Active        │        -273 │             27460.0 │                    0.0 │       180000.0 │            108982.62 │ Credit card     │        NULL │
│     215354 │ Active        │         -43 │                79.0 │                    0.0 │        42103.8 │                  0.0 │ Consumer credit │        NULL │
│     162297 │ Closed        │       -1896 │             -1684.0 │                14985.0 │       76878.45 │                  0.0 │ Consumer credit │        NULL │
│     162297 │ Closed        │       -1146 │              -811.0 │                    0.0 │       103007.7 │                  0.0 │ Consumer credit │        NULL │
│     162297 │ Active        │       -1146 │              -484.0 │                    0.0 │         4500.0 │                  0.0 │ Credit card     │        NULL │
├────────────┴───────────────┴─────────────┴─────────────────────┴────────────────────────┴────────────────┴──────────────────────┴─────────────────┴─────────────┤
│ 10 rows                                                                                                                                               9 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

There really won’t be any issue navigating between the different APIs. In fact, we’ll be able to leverage the unique strengths of each of them to best advance our analysis.

Exploration

Skim each dataset

We’ll start off by getting familiar with each of the data sets. We’ll use the {skimpy} package for this and do some visualizations. We’ll rely on the ERD to help with interpretation:

application_train.csv

Main tables - out train and test samples; target (binary); info about loan and loan applicant at application time.

This file is the same as application_test.csv except that it features the target variable.

from skimpy import skim

app_train_df = (
    duckdb                                                 # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/application_train.csv'")     # Read the file
    .fetchdf()                                             # Convert to a pandas dataframe
)

skim(app_train_df)                                         # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 307511 │ │ float64     │ 65    │                                                          │
│ │ Number of columns │ 122    │ │ int32       │ 41    │                                                          │
│ └───────────────────┴────────┘ │ string      │ 16    │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_n                                                                                         ┃  │
│ ┃ ame       NA      NA %   mean      sd       p0        p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_CU     0    0  278200 102800  100000 189100 278200 367100 456300▇▇▇▇▇▇ │  │
│ │ RR       │        │       │          │         │          │         │         │         │         │        │  │
│ │ TARGET       0    0 0.08073 0.2724       0      0      0      0      1▇    ▁ │  │
│ │ CNT_CHIL     0    0  0.4171 0.7221       0      0      0      1     19 │  │
│ │ DREN     │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_INCO     0    0  168800 237100   25650 112500 147200 2025001170000 │  │
│ │ ME_TOTAL │        │       │          │         │          │         │         │         │      00 │        │  │
│ │ AMT_CRED     0    0  599000 402500   45000 270000 513500 8086004050000  ▇▃   │  │
│ │ IT       │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_ANNU    12    0   27110  14490    1616  16520  24900  34600 258000  ▇▁   │  │
│ │ ITY      │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_GOOD   278 0.09  538400 369400   40500 238500 450000 6795004050000  ▇▂   │  │
│ │ S_PRICE  │        │       │          │         │          │         │         │         │         │        │  │
│ │ REGION_P     0    0 0.020870.01383 0.000290.010010.018850.028660.07251▇▇▇▁ ▁ │  │
│ │ OPULATIO │        │       │          │         │          │         │         │         │         │        │  │
│ │ N_RELATI │        │       │          │         │          │         │         │         │         │        │  │
│ │ VE       │        │       │          │         │          │         │         │         │         │        │  │
│ │ DAYS_BIR     0    0  -16040   4364  -25230 -19680 -15750 -12410  -7489▃▆▆▇▇▃ │  │
│ │ TH       │        │       │          │         │          │         │         │         │         │        │  │
│ │ DAYS_EMP     0    0   63820 141300  -17910  -2760  -1213   -289 365200▇    ▂ │  │
│ │ LOYED    │        │       │          │         │          │         │         │         │         │        │  │
│ │ DAYS_REG     0    0   -4986   3523  -24670  -7480  -4504  -2010      0  ▁▃▆▇ │  │
│ │ ISTRATIO │        │       │          │         │          │         │         │         │         │        │  │
│ │ N        │        │       │          │         │          │         │         │         │         │        │  │
│ │ DAYS_ID_     0    0   -2994   1509   -7197  -4299  -3254  -1720      0 ▂▇▅▅▃ │  │
│ │ PUBLISH  │        │       │          │         │          │         │         │         │         │        │  │
│ │ OWN_CAR_20292965.99   12.06  11.94       0      5      9     15     91  ▇▂   │  │
│ │ AGE      │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_MOB     0    0       10.00180       0      1      1      1      1 │  │
│ │ IL       │        │       │          │       3 │          │         │         │         │         │        │  │
│ │ FLAG_EMP     0    0  0.8199 0.3843       0      1      1      1      1▂    ▇ │  │
│ │ _PHONE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_WOR     0    0  0.1994 0.3995       0      0      0      0      1▇    ▂ │  │
│ │ K_PHONE  │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_CON     0    0  0.99810.04316       0      1      1      1      1 │  │
│ │ T_MOBILE │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_PHO     0    0  0.2811 0.4495       0      0      0      1      1▇    ▃ │  │
│ │ NE       │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_EMA     0    0 0.05672 0.2313       0      0      0      0      1 │  │
│ │ IL       │        │       │          │         │          │         │         │         │         │        │  │
│ │ CNT_FAM_     2    0   2.153 0.9107       1      2      2      3     20 │  │
│ │ MEMBERS  │        │       │          │         │          │         │         │         │         │        │  │
│ │ REGION_R     0    0   2.052  0.509       1      2      2      2      3▁  ▇ ▂ │  │
│ │ ATING_CL │        │       │          │         │          │         │         │         │         │        │  │
│ │ IENT     │        │       │          │         │          │         │         │         │         │        │  │
│ │ REGION_R     0    0   2.032 0.5027       1      2      2      2      3▁  ▇ ▂ │  │
│ │ ATING_CL │        │       │          │         │          │         │         │         │         │        │  │
│ │ IENT_W_C │        │       │          │         │          │         │         │         │         │        │  │
│ │ ITY      │        │       │          │         │          │         │         │         │         │        │  │
│ │ HOUR_APP     0    0   12.06  3.266       0     10     12     14     23 ▁▇▇▃  │  │
│ │ R_PROCES │        │       │          │         │          │         │         │         │         │        │  │
│ │ S_START  │        │       │          │         │          │         │         │         │         │        │  │
│ │ REG_REGI     0    0 0.01514 0.1221       0      0      0      0      1 │  │
│ │ ON_NOT_L │        │       │          │         │          │         │         │         │         │        │  │
│ │ IVE_REGI │        │       │          │         │          │         │         │         │         │        │  │
│ │ ON       │        │       │          │         │          │         │         │         │         │        │  │
│ │ REG_REGI     0    0 0.05077 0.2195       0      0      0      0      1 │  │
│ │ ON_NOT_W │        │       │          │         │          │         │         │         │         │        │  │
│ │ ORK_REGI │        │       │          │         │          │         │         │         │         │        │  │
│ │ ON       │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVE_REG     0    0 0.04066 0.1975       0      0      0      0      1 │  │
│ │ ION_NOT_ │        │       │          │         │          │         │         │         │         │        │  │
│ │ WORK_REG │        │       │          │         │          │         │         │         │         │        │  │
│ │ ION      │        │       │          │         │          │         │         │         │         │        │  │
│ │ REG_CITY     0    0 0.07817 0.2684       0      0      0      0      1▇    ▁ │  │
│ │ _NOT_LIV │        │       │          │         │          │         │         │         │         │        │  │
│ │ E_CITY   │        │       │          │         │          │         │         │         │         │        │  │
│ │ REG_CITY     0    0  0.2305 0.4211       0      0      0      0      1▇    ▂ │  │
│ │ _NOT_WOR │        │       │          │         │          │         │         │         │         │        │  │
│ │ K_CITY   │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVE_CIT     0    0  0.1796 0.3838       0      0      0      0      1▇    ▂ │  │
│ │ Y_NOT_WO │        │       │          │         │          │         │         │         │         │        │  │
│ │ RK_CITY  │        │       │          │         │          │         │         │         │         │        │  │
│ │ EXT_SOUR17337856.38  0.5021 0.2111 0.01457  0.334  0.506 0.6751 0.9627▂▆▇▇▇▃ │  │
│ │ CE_1     │        │       │          │         │          │         │         │         │         │        │  │
│ │ EXT_SOUR   660 0.21  0.5144 0.19118.174e-0 0.3925  0.566 0.6636  0.855▁▂▃▅▇▃ │  │
│ │ CE_2     │        │       │          │         │        8 │         │         │         │         │        │  │
│ │ EXT_SOUR 6096519.83  0.5109 0.19480.000527 0.3706 0.5353 0.6691  0.896▁▃▅▇▇▃ │  │
│ │ CE_3     │        │       │          │         │        3 │         │         │         │         │        │  │
│ │ APARTMEN15606150.75  0.1174 0.1082       0 0.0577 0.0876 0.1485      1  ▇▂   │  │
│ │ TS_AVG   │        │       │          │         │          │         │         │         │         │        │  │
│ │ BASEMENT17994358.52 0.088440.08244       0 0.0442 0.0763 0.1122      1  ▇▁   │  │
│ │ AREA_AVG │        │       │          │         │          │         │         │         │         │        │  │
│ │ YEARS_BE15000748.78  0.97770.05922       0 0.9767 0.9816 0.9866      1 │  │
│ │ GINEXPLU │        │       │          │         │          │         │         │         │         │        │  │
│ │ ATATION_ │        │       │          │         │          │         │         │         │         │        │  │
│ │ AVG      │        │       │          │         │          │         │         │         │         │        │  │
│ │ YEARS_BU204488 66.5  0.7525 0.1133       0 0.6872 0.7552 0.8232      1   ▂▇▃ │  │
│ │ ILD_AVG  │        │       │          │         │          │         │         │         │         │        │  │
│ │ COMMONAR21486569.87 0.044620.07604       0 0.0078 0.0211 0.0515      1 │  │
│ │ EA_AVG   │        │       │          │         │          │         │         │         │         │        │  │
│ │ ELEVATOR163891 53.3 0.07894 0.1346       0      0      0   0.12      1  ▇▁   │  │
│ │ S_AVG    │        │       │          │         │          │         │         │         │         │        │  │
│ │ ENTRANCE15482850.35  0.1497    0.1       0  0.069 0.1379 0.2069      1  ▇▃   │  │
│ │ S_AVG    │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLOORSMA15302049.76  0.2263 0.1446       0 0.1667 0.1667 0.3333      1 ▃▇▁   │  │
│ │ X_AVG    │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLOORSMI20864267.85  0.2319 0.1614       0 0.0833 0.2083  0.375      1▆▇▅▁▁  │  │
│ │ N_AVG    │        │       │          │         │          │         │         │         │         │        │  │
│ │ LANDAREA18259059.38 0.066330.08118       0 0.0187 0.0481 0.0856      1 │  │
│ │ _AVG     │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVINGAP21019968.35  0.10080.09258       0 0.0504 0.0756  0.121      1  ▇▁   │  │
│ │ ARTMENTS │        │       │          │         │          │         │         │         │         │        │  │
│ │ _AVG     │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVINGAR15435050.19  0.1074 0.1106       0 0.0453 0.0745 0.1299      1  ▇▁   │  │
│ │ EA_AVG   │        │       │          │         │          │         │         │         │         │        │  │
│ │ NONLIVIN21351469.430.0088090.04773       0      0      0 0.0039      1 │  │
│ │ GAPARTME │        │       │          │         │          │         │         │         │         │        │  │
│ │ NTS_AVG  │        │       │          │         │          │         │         │         │         │        │  │
│ │ NONLIVIN16968255.18 0.028360.06952       0      0 0.0036 0.0277      1 │  │
│ │ GAREA_AV │        │       │          │         │          │         │         │         │         │        │  │
│ │ G        │        │       │          │         │          │         │         │         │         │        │  │
│ │ APARTMEN15606150.75  0.1142 0.1079       0 0.0525  0.084 0.1439      1  ▇▁   │  │
│ │ TS_MODE  │        │       │          │         │          │         │         │         │         │        │  │
│ │ BASEMENT17994358.52 0.087540.08431       0 0.0407 0.0746 0.1124      1  ▇▁   │  │
│ │ AREA_MOD │        │       │          │         │          │         │         │         │         │        │  │
│ │ E        │        │       │          │         │          │         │         │         │         │        │  │
│ │ YEARS_BE15000748.78  0.97710.06458       0 0.9767 0.9816 0.9866      1 │  │
│ │ GINEXPLU │        │       │          │         │          │         │         │         │         │        │  │
│ │ ATATION_ │        │       │          │         │          │         │         │         │         │        │  │
│ │ MODE     │        │       │          │         │          │         │         │         │         │        │  │
│ │ YEARS_BU204488 66.5  0.7596 0.1101       0 0.6994 0.7648 0.8236      1   ▂▇▃ │  │
│ │ ILD_MODE │        │       │          │         │          │         │         │         │         │        │  │
│ │ COMMONAR21486569.87 0.042550.07444       0 0.0072  0.019  0.049      1 │  │
│ │ EA_MODE  │        │       │          │         │          │         │         │         │         │        │  │
│ │ ELEVATOR163891 53.3 0.07449 0.1323       0      0      0 0.1208      1  ▇▁   │  │
│ │ S_MODE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ ENTRANCE15482850.35  0.1452  0.101       0  0.069 0.1379 0.2069      1  ▇▃   │  │
│ │ S_MODE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLOORSMA15302049.76  0.2223 0.1437       0 0.1667 0.1667 0.3333      1 ▃▇▁   │  │
│ │ X_MODE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLOORSMI20864267.85  0.2281 0.1612       0 0.0833 0.2083  0.375      1▆▇▅▁▁  │  │
│ │ N_MODE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ LANDAREA18259059.38 0.064960.08175       0 0.0166 0.0458 0.0841      1 │  │
│ │ _MODE    │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVINGAP21019968.35  0.10560.09788       0 0.0542 0.0771 0.1313      1  ▇▁   │  │
│ │ ARTMENTS │        │       │          │         │          │         │         │         │         │        │  │
│ │ _MODE    │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVINGAR15435050.19   0.106 0.1118       0 0.0427 0.0731 0.1252      1  ▇▁   │  │
│ │ EA_MODE  │        │       │          │         │          │         │         │         │         │        │  │
│ │ NONLIVIN21351469.430.0080760.04628       0      0      0 0.0039      1 │  │
│ │ GAPARTME │        │       │          │         │          │         │         │         │         │        │  │
│ │ NTS_MODE │        │       │          │         │          │         │         │         │         │        │  │
│ │ NONLIVIN16968255.18 0.027020.07025       0      0 0.0011 0.0231      1 │  │
│ │ GAREA_MO │        │       │          │         │          │         │         │         │         │        │  │
│ │ DE       │        │       │          │         │          │         │         │         │         │        │  │
│ │ APARTMEN15606150.75  0.1178 0.1091       0 0.0583 0.0864 0.1489      1  ▇▁   │  │
│ │ TS_MEDI  │        │       │          │         │          │         │         │         │         │        │  │
│ │ BASEMENT17994358.52 0.087950.08218       0 0.0437 0.0758 0.1116      1  ▇▁   │  │
│ │ AREA_MED │        │       │          │         │          │         │         │         │         │        │  │
│ │ I        │        │       │          │         │          │         │         │         │         │        │  │
│ │ YEARS_BE15000748.78  0.9778 0.0599       0 0.9767 0.9816 0.9866      1 │  │
│ │ GINEXPLU │        │       │          │         │          │         │         │         │         │        │  │
│ │ ATATION_ │        │       │          │         │          │         │         │         │         │        │  │
│ │ MEDI     │        │       │          │         │          │         │         │         │         │        │  │
│ │ YEARS_BU204488 66.5  0.7557 0.1121       0 0.6914 0.7585 0.8256      1   ▂▇▃ │  │
│ │ ILD_MEDI │        │       │          │         │          │         │         │         │         │        │  │
│ │ COMMONAR21486569.87  0.04460.07614       0 0.0079 0.0208 0.0513      1 │  │
│ │ EA_MEDI  │        │       │          │         │          │         │         │         │         │        │  │
│ │ ELEVATOR163891 53.3 0.07808 0.1345       0      0      0   0.12      1  ▇▁   │  │
│ │ S_MEDI   │        │       │          │         │          │         │         │         │         │        │  │
│ │ ENTRANCE15482850.35  0.1492 0.1004       0  0.069 0.1379 0.2069      1  ▇▃   │  │
│ │ S_MEDI   │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLOORSMA15302049.76  0.2259 0.1451       0 0.1667 0.1667 0.3333      1 ▃▇▁   │  │
│ │ X_MEDI   │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLOORSMI20864267.85  0.2316 0.1619       0 0.0833 0.2083  0.375      1▆▇▅▁▁  │  │
│ │ N_MEDI   │        │       │          │         │          │         │         │         │         │        │  │
│ │ LANDAREA18259059.38 0.067170.08217       0 0.0187 0.0487 0.0868      1 │  │
│ │ _MEDI    │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVINGAP21019968.35   0.1020.09364       0 0.0513 0.0761 0.1231      1  ▇▁   │  │
│ │ ARTMENTS │        │       │          │         │          │         │         │         │         │        │  │
│ │ _MEDI    │        │       │          │         │          │         │         │         │         │        │  │
│ │ LIVINGAR15435050.19  0.1086 0.1123       0 0.0457 0.0749 0.1303      1  ▇▁   │  │
│ │ EA_MEDI  │        │       │          │         │          │         │         │         │         │        │  │
│ │ NONLIVIN21351469.430.0086510.04741       0      0      0 0.0039      1 │  │
│ │ GAPARTME │        │       │          │         │          │         │         │         │         │        │  │
│ │ NTS_MEDI │        │       │          │         │          │         │         │         │         │        │  │
│ │ NONLIVIN16968255.18 0.028240.07017       0      0 0.0031 0.0266      1 │  │
│ │ GAREA_ME │        │       │          │         │          │         │         │         │         │        │  │
│ │ DI       │        │       │          │         │          │         │         │         │         │        │  │
│ │ TOTALARE14843148.27  0.1025 0.1075       0 0.0412 0.0688 0.1276      1  ▇▁   │  │
│ │ A_MODE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ OBS_30_C  1021 0.33   1.422  2.401       0      0      0      2    348 │  │
│ │ NT_SOCIA │        │       │          │         │          │         │         │         │         │        │  │
│ │ L_CIRCLE │        │       │          │         │          │         │         │         │         │        │  │
│ │ DEF_30_C  1021 0.33  0.1434 0.4467       0      0      0      0     34 │  │
│ │ NT_SOCIA │        │       │          │         │          │         │         │         │         │        │  │
│ │ L_CIRCLE │        │       │          │         │          │         │         │         │         │        │  │
│ │ OBS_60_C  1021 0.33   1.405   2.38       0      0      0      2    344 │  │
│ │ NT_SOCIA │        │       │          │         │          │         │         │         │         │        │  │
│ │ L_CIRCLE │        │       │          │         │          │         │         │         │         │        │  │
│ │ DEF_60_C  1021 0.33     0.1 0.3623       0      0      0      0     24 │  │
│ │ NT_SOCIA │        │       │          │         │          │         │         │         │         │        │  │
│ │ L_CIRCLE │        │       │          │         │          │         │         │         │         │        │  │
│ │ DAYS_LAS     1    0  -962.9  826.8   -4292  -1570   -757   -274      0  ▁▃▃▇ │  │
│ │ T_PHONE_ │        │       │          │         │          │         │         │         │         │        │  │
│ │ CHANGE   │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    04.227e-00.00650       0      0      0      0      1 │  │
│ │ UMENT_2  │        │       │        5      2 │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    0    0.71 0.4538       0      0      1      1      1▃    ▇ │  │
│ │ UMENT_3  │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    08.13e-050.00901       0      0      0      0      1 │  │
│ │ UMENT_4  │        │       │          │       6 │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    0 0.01511  0.122       0      0      0      0      1 │  │
│ │ UMENT_5  │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    0 0.08806 0.2834       0      0      0      0      1▇    ▁ │  │
│ │ UMENT_6  │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0001910.01385       0      0      0      0      1 │  │
│ │ UMENT_7  │        │       │        9 │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    0 0.08138 0.2734       0      0      0      0      1▇    ▁ │  │
│ │ UMENT_8  │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0038960.06229       0      0      0      0      1 │  │
│ │ UMENT_9  │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    02.276e-00.00477       0      0      0      0      1 │  │
│ │ UMENT_10 │        │       │        5      1 │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0039120.06242       0      0      0      0      1 │  │
│ │ UMENT_11 │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    06.504e-00.00255       0      0      0      0      1 │  │
│ │ UMENT_12 │        │       │        6 │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0035250.05927       0      0      0      0      1 │  │
│ │ UMENT_13 │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0029360.05411       0      0      0      0      1 │  │
│ │ UMENT_14 │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    0 0.001210.03476       0      0      0      0      1 │  │
│ │ UMENT_15 │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0099280.09914       0      0      0      0      1 │  │
│ │ UMENT_16 │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0002660.01633       0      0      0      0      1 │  │
│ │ UMENT_17 │        │       │        7 │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    0 0.00813 0.0898       0      0      0      0      1 │  │
│ │ UMENT_18 │        │       │          │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0005950.02439       0      0      0      0      1 │  │
│ │ UMENT_19 │        │       │        1 │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.0005070.02252       0      0      0      0      1 │  │
│ │ UMENT_20 │        │       │        3 │         │          │         │         │         │         │        │  │
│ │ FLAG_DOC     0    00.000334 0.0183       0      0      0      0      1 │  │
│ │ UMENT_21 │        │       │        9 │         │          │         │         │         │         │        │  │
│ │ AMT_REQ_ 41519 13.50.0064020.08385       0      0      0      0      4 │  │
│ │ CREDIT_B │        │       │          │         │          │         │         │         │         │        │  │
│ │ UREAU_HO │        │       │          │         │          │         │         │         │         │        │  │
│ │ UR       │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_REQ_ 41519 13.5   0.007 0.1108       0      0      0      0      9 │  │
│ │ CREDIT_B │        │       │          │         │          │         │         │         │         │        │  │
│ │ UREAU_DA │        │       │          │         │          │         │         │         │         │        │  │
│ │ Y        │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_REQ_ 41519 13.5 0.03436 0.2047       0      0      0      0      8 │  │
│ │ CREDIT_B │        │       │          │         │          │         │         │         │         │        │  │
│ │ UREAU_WE │        │       │          │         │          │         │         │         │         │        │  │
│ │ EK       │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_REQ_ 41519 13.5  0.2674  0.916       0      0      0      0     27 │  │
│ │ CREDIT_B │        │       │          │         │          │         │         │         │         │        │  │
│ │ UREAU_MO │        │       │          │         │          │         │         │         │         │        │  │
│ │ N        │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_REQ_ 41519 13.5  0.2655 0.7941       0      0      0      0    261 │  │
│ │ CREDIT_B │        │       │          │         │          │         │         │         │         │        │  │
│ │ UREAU_QR │        │       │          │         │          │         │         │         │         │        │  │
│ │ T        │        │       │          │         │          │         │         │         │         │        │  │
│ │ AMT_REQ_ 41519 13.5     1.9  1.869       0      0      1      3     25  ▇▁   │  │
│ │ CREDIT_B │        │       │          │         │          │         │         │         │         │        │  │
│ │ UREAU_YE │        │       │          │         │          │         │         │         │         │        │  │
│ │ AR       │        │       │          │         │          │         │         │         │         │        │  │
│ └──────────┴────────┴───────┴──────────┴─────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA           NA %       words per row           total words        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩  │
│ │ NAME_CONTRACT_TYPE                          0        0                     2            615022 │  │
│ │ CODE_GENDER                                 0        0                     1            307511 │  │
│ │ FLAG_OWN_CAR                                0        0                     1            307511 │  │
│ │ FLAG_OWN_REALTY                             0        0                     1            307511 │  │
│ │ NAME_TYPE_SUITE                          1292     0.42                     1            318131 │  │
│ │ NAME_INCOME_TYPE                            0        0                   1.3            400836 │  │
│ │ NAME_EDUCATION_TYPE                         0        0                   3.4           1051804 │  │
│ │ NAME_FAMILY_STATUS                          0        0                   1.5            473618 │  │
│ │ NAME_HOUSING_TYPE                           0        0                   2.9            887890 │  │
│ │ OCCUPATION_TYPE                         96391    31.35                   1.1            341374 │  │
│ │ WEEKDAY_APPR_PROCESS_START                  0        0                     1            307511 │  │
│ │ ORGANIZATION_TYPE                           0        0                   2.1            638609 │  │
│ │ FONDKAPREMONT_MODE                     210295    68.39                  0.97            298041 │  │
│ │ HOUSETYPE_MODE                         154297    50.18                   1.5            456931 │  │
│ │ WALLSMATERIAL_MODE                     156341    50.84                   0.7            215985 │  │
│ │ EMERGENCYSTATE_MODE                    145755     47.4                  0.53            161756 │  │
│ └────────────────────────────────────┴─────────────┴───────────┴────────────────────────┴────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

This is a LARGE data set. Let’s break down some of the observations we can glean from the summary:

Data Types

  • There’s several variables casted as numeric that should be categorical
    • Any variables with FLAG_ prefixes: ~32 columns
      • Examples: FLAG_MOBIL, FLAG_DOCUMENT_#, …
    • Other variables treated as flags but not specified in the name: ~6 columns
      • Examples: REG_REGION_NOT_LIVE_REGION, LIVE_CITY_NOT_WORK_CITY, …
    • REGION_RATING_ variables that are ordinal classificaitons
  • Many variables of type string could be candidates for categorical depending on count of unique values
    • Examples: NAME_CONTRACT_TYPE, ORGANIZATION_TYPE
  • Some variables casted as string but may need to be numeric
    • Any with _MODE suffix

Missing Data

  • EXT_SOURCE_# these are scores ingested from third parties
    • Understandably sparse; not every client would be represented in third-party systems
  • Fields including APARTMENT_, BASEMENT_, ELEVATOR_, etc., all relate to the client’s current dwelling.
    • 43 of the file’s 122 columns relate to the client’s dwelling
    • These are very sparse, about ~50% complete. These fields represent most of the missing data
    • Presumably, the lender was unable to collect such detail for every client
  • OBS_##_SOCIAL_CIRCLE and DEF_##_SOCIAL_CIRCLE relate to default rates on observation ## days past due
    • This is a count that’s heavily skewed to the right
    • <1% of observations have these fields as missing
  • AMT_REQ_CREDIT_BUREAU_XX are fields measure the number of credit inqueries in XX timeframe
    • These data set is geared around users without sufficient credit history so makes sense this is missing
    • ~13% of the data is missing
  • OCCUPATION_TYPE refers to the type of job the user has
    • Missing for virtually 1/3 clients
    • ORGANIZATION_TYPE, however, is not missing so we do retain some attribute relative to their work
    • Perhaps in combination with NAME_EDUCATION_TYPE there’s predictive power

Distribution

  • All dwelling related variables have been normalized based on description column descriptions file
    • Checking the histograms to the far right, this appears to be accurate
  • There’s a couple variables with skewed distributions
    • DAYS_REGISTRATION: skewed right
    • DAYS_LAST_PHONE_CHANGE: skewed right

Potential feature engineering

  • Dimensionality reduction
    • FLAG_DOCUMENT_# fields could potentially be summarized a few ways;
      • % of documents completed
      • Finding principal components
      • Maybe only certain documents matter
    • There’s many measures of central tendency for dwelling (avg, mode, median)
      • Likely not all for each perspective is needed

bureau.csv

Application data from previous loans that client got from other institutions that were reported to the Credit Bureau. One row per client’s loan in Credit Bureau

This file will tell us all about the previous loans any credit history client would have. Let’s digest the data:

bureau_df = (
    duckdb                                      # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/bureau.csv'")     # Read the file
    .fetchdf()                                  # Convert to a pandas dataframe
)

skim(bureau_df)                                 # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                 Data Types                                                                │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                         │
│ ┃ dataframe          Values  ┃ ┃ Column Type  Count ┃                                                         │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                         │
│ │ Number of rows    │ 1716428 │ │ float64     │ 8     │                                                         │
│ │ Number of columns │ 17      │ │ int32       │ 6     │                                                         │
│ └───────────────────┴─────────┘ │ string      │ 3     │                                                         │
│                                 └─────────────┴───────┘                                                         │
│                                                     number                                                      │
│ ┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_n                                                                                         ┃  │
│ ┃ ame       NA       NA %   mean     sd       p0        p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_CU      0    0 278200 102900  100000 188900 278100 367400 456300▇▇▇▇▇▇ │  │
│ │ RR       │         │       │         │         │          │         │         │         │         │        │  │
│ │ SK_ID_BU      0    05924000 532300 50000005464000592600063860006843000▇▇▇▇▇▇ │  │
│ │ REAU     │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_CRE      0    0  -1142  795.2   -2922  -1666   -987   -474      0▃▃▃▆▇▇ │  │
│ │ DIT      │         │       │         │         │          │         │         │         │         │        │  │
│ │ CREDIT_D      0    0 0.8182  36.54       0      0      0      0   2792 │  │
│ │ AY_OVERD │         │       │         │         │          │         │         │         │         │        │  │
│ │ UE       │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_CRE 105553 6.15  510.5   4994  -42060  -1138   -330    474  31200 │  │
│ │ DIT_ENDD │         │       │         │         │          │         │         │         │         │        │  │
│ │ ATE      │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_END 63365336.92  -1017    714  -42020  -1489   -897   -425      0 │  │
│ │ DATE_FAC │         │       │         │         │          │         │         │         │         │        │  │
│ │ T        │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_CRED112448865.51   3825 206000       0      0      0      01160000 │  │
│ │ IT_MAX_O │         │       │         │         │          │         │         │         │      00 │        │  │
│ │ VERDUE   │         │       │         │         │          │         │         │         │         │        │  │
│ │ CNT_CRED      0    00.006410.09622       0      0      0      0      9 │  │
│ │ IT_PROLO │         │       │         │         │          │         │         │         │         │        │  │
│ │ NG       │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_CRED     13    0 3550001150000       0  51300 125500 3150005850000 │  │
│ │ IT_SUM   │         │       │         │         │          │         │         │         │      00 │        │  │
│ │ AMT_CRED 25766915.01 137100 677400-4706000      0      0  401501701000 │  │
│ │ IT_SUM_D │         │       │         │         │          │         │         │         │      00 │        │  │
│ │ EBT      │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_CRED 59178034.48   6230  45030 -586400      0      0      04706000 │  │
│ │ IT_SUM_L │         │       │         │         │          │         │         │         │         │        │  │
│ │ IMIT     │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_CRED      0    0  37.91   5938       0      0      0      03757000 │  │
│ │ IT_SUM_O │         │       │         │         │          │         │         │         │         │        │  │
│ │ VERDUE   │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_CRE      0    0 -593.7  720.7  -41950   -908   -395    -33    372 │  │
│ │ DIT_UPDA │         │       │         │         │          │         │         │         │         │        │  │
│ │ TE       │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_ANNU122679171.47  15710 325800       0      0      0  135001185000 │  │
│ │ ITY      │         │       │         │         │          │         │         │         │      00 │        │  │
│ └──────────┴─────────┴───────┴─────────┴─────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                      NA      NA %        words per row               total words           ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ CREDIT_ACTIVE                       0         0                         1              1716449 │  │
│ │ CREDIT_CURRENCY                     0         0                         2              3432856 │  │
│ │ CREDIT_TYPE                         0         0                         2              3410733 │  │
│ └─────────────────────────────────┴────────┴────────────┴────────────────────────────┴───────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

The first thing to notice is the ~1.7M records, compared to the previous file of ~360K. Cearly there will be multiple historical bureau records for each matching application.

Data Types

  • All seems to be in fair order
  • There’s opportunity with the 3 string variables to cast as a categorical

Missing Data

  • 5 columns are missing data
    • DAYS_CREDIT_ENDDATE and DAYS_ENDDATE_FACT measure days between current application and end date of bureau credit; the end date must just be missing
    • AMT_CREDIT_MAX_OVERDUE is curiously missing a lot of data; perhaps due to there not having been an overdue balance to report
    • AMT_CREDIT_SUM_DEBT is understandable if someone had no current debt sums
    • AMT_CREDIT_SUM_LIMIT is understandably blank should a client not have a credit card
    • AMT_ANNUITY only those with an annuity loan would have values here

Many of those seem resonable to populate with zeros in a cleaning phase.

Distribution

  • DAYS_CREDIT seem to be the only significantly skewed variable

Potential feature engineering

  • I’d be curious just how much variability in credit reports per individual could be explained by dimensionality reduction (PCA)

bureau_balance.csv

Monthly balance of credits in Credit Bureau

Essentially, we have an expanded version of balances per client loan reported to the bureau.

bureau_bal_df = (
    duckdb                                              # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/bureau_balance.csv'")     # Read the file
    .fetchdf()                                          # Convert to a pandas dataframe
)

skim(bureau_bal_df)                                     # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│           Data Summary                 Data Types                                                               │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                        │
│ ┃ dataframe          Values   ┃ ┃ Column Type  Count ┃                                                        │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                        │
│ │ Number of rows    │ 27299925 │ │ int32       │ 2     │                                                        │
│ │ Number of columns │ 3        │ │ string      │ 1     │                                                        │
│ └───────────────────┴──────────┘ └─────────────┴───────┘                                                        │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name      NA   NA %  mean     sd      p0       p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_BUREAU     0   0603600049230050020005731000607100064320006843000▃▃▅▇▆▆ │  │
│ │ MONTHS_BALANCE   0   0 -30.74 23.86    -96    -46    -25    -11      0▁▂▂▃▆▇ │  │
│ └─────────────────┴─────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                NA       NA %        words per row                 total words              ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ STATUS                         0         0                           1                27299925 │  │
│ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

The first thing I’m seeing is ~27.2M records. Multiple monthly balances per loan.

Data Types

  • All seems to be in fair order
  • There’s opportunity the status variable to cast as a categorical

Missing Data

No missing data

Distribution

  • MONTHS_BLANCE seems to be skewed to the right

Potential feature engineering

  • It’s possible this dataset can be represented in 2 variables:
    • Existence of previous loan: boolean
    • Median oustanding balance: numeric

previous_application.csv

Application data of client’s previous loans in Home Credit. Info about the previous loan parameters and client info at time of previous application. One row per previous application.

This file details previous applications with Home Credit. Some clients may have never applied for loans previously while others could have had multiple applications.

prev_app_df = (
    duckdb                                                      # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/previous_application.csv'")       # Read the file
    .fetchdf()                                                  # Convert to a pandas dataframe
)

skim(prev_app_df)                                               # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                 Data Types                                                                │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                         │
│ ┃ dataframe          Values  ┃ ┃ Column Type  Count ┃                                                         │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                         │
│ │ Number of rows    │ 1670214 │ │ string      │ 16    │                                                         │
│ │ Number of columns │ 37      │ │ float64     │ 15    │                                                         │
│ └───────────────────┴─────────┘ │ int32       │ 6     │                                                         │
│                                 └─────────────┴───────┘                                                         │
│                                                     number                                                      │
│ ┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_n                                                                                         ┃  │
│ ┃ ame       NA       NA %   mean     sd       p0        p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_PR      0    01923000 532600 10000001462000192300023840002845000▇▇▇▇▇▇ │  │
│ │ EV       │         │       │         │         │          │         │         │         │         │        │  │
│ │ SK_ID_CU      0    0 278400 102800  100000 189300 278700 367500 456300▇▇▇▇▇▇ │  │
│ │ RR       │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_ANNU 37223522.29  15960  14780       0   6322  11250  20660 418100 │  │
│ │ ITY      │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_APPL      0    0 175200 292800       0  18720  71050 1804006905000 │  │
│ │ ICATION  │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_CRED      1    0 196100 318600       0  24160  80540 2164006905000 │  │
│ │ IT       │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_DOWN 89584453.64   6697  20920    -0.9      0   1638   77403060000 │  │
│ │ _PAYMENT │         │       │         │         │          │         │         │         │         │        │  │
│ │ AMT_GOOD 38551523.08 227800 315400       0  50840 112300 2340006905000 │  │
│ │ S_PRICE  │         │       │         │         │          │         │         │         │         │        │  │
│ │ HOUR_APP      0    0  12.48  3.334       0     10     12     15     23 ▁▇▇▃  │  │
│ │ R_PROCES │         │       │         │         │          │         │         │         │         │        │  │
│ │ S_START  │         │       │         │         │          │         │         │         │         │        │  │
│ │ NFLAG_LA      0    0 0.99650.05933       0      1      1      1      1 │  │
│ │ ST_APPL_ │         │       │         │         │          │         │         │         │         │        │  │
│ │ IN_DAY   │         │       │         │         │          │         │         │         │         │        │  │
│ │ RATE_DOW 89584453.640.07964 0.1078-1.498e-      00.05161 0.1089      1  ▇▁   │  │
│ │ N_PAYMEN │         │       │         │         │       05 │         │         │         │         │        │  │
│ │ T        │         │       │         │         │          │         │         │         │         │        │  │
│ │ RATE_INT166426399.64 0.18840.08767 0.03478 0.1607 0.1891 0.1933      1  ▇▁   │  │
│ │ EREST_PR │         │       │         │         │          │         │         │         │         │        │  │
│ │ IMARY    │         │       │         │         │          │         │         │         │         │        │  │
│ │ RATE_INT166426399.64 0.7735 0.1009  0.3732 0.7156 0.8351 0.8525      1  ▂▂▇  │  │
│ │ EREST_PR │         │       │         │         │          │         │         │         │         │        │  │
│ │ IVILEGED │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_DEC      0    0 -880.7  779.1   -2922  -1300   -581   -280     -1▁▁▁▂▅▇ │  │
│ │ ISION    │         │       │         │         │          │         │         │         │         │        │  │
│ │ SELLERPL      0    0    314   7127      -1     -1      3     824000000 │  │
│ │ ACE_AREA │         │       │         │         │          │         │         │         │         │        │  │
│ │ CNT_PAYM 37223022.29  16.05  14.57       0      6     12     24     84 ▇▂▁▁  │  │
│ │ ENT      │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_FIR 673065 40.3 342200  88920   -2922 365200 365200 365200 365200▁    ▇ │  │
│ │ ST_DRAWI │         │       │         │         │          │         │         │         │         │        │  │
│ │ NG       │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_FIR 673065 40.3  13830  72440   -2892  -1628   -831   -411 365200 │  │
│ │ ST_DUE   │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_LAS 673065 40.3  33770 106900   -2801  -1242   -361    129 365200▇    ▁ │  │
│ │ T_DUE_1S │         │       │         │         │          │         │         │         │         │        │  │
│ │ T_VERSIO │         │       │         │         │          │         │         │         │         │        │  │
│ │ N        │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_LAS 673065 40.3  76580 149600   -2889  -1314   -537    -74 365200▇    ▂ │  │
│ │ T_DUE    │         │       │         │         │          │         │         │         │         │        │  │
│ │ DAYS_TER 673065 40.3  81990 153300   -2874  -1270   -499    -44 365200▇    ▂ │  │
│ │ MINATION │         │       │         │         │          │         │         │         │         │        │  │
│ │ NFLAG_IN 673065 40.3 0.3326 0.4711       0      0      0      1      1▇    ▃ │  │
│ │ SURED_ON │         │       │         │         │          │         │         │         │         │        │  │
│ │ _APPROVA │         │       │         │         │          │         │         │         │         │        │  │
│ │ L        │         │       │         │         │          │         │         │         │         │        │  │
│ └──────────┴─────────┴───────┴─────────┴─────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA           NA %       words per row           total words        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩  │
│ │ NAME_CONTRACT_TYPE                          0        0                     2           3340082 │  │
│ │ WEEKDAY_APPR_PROCESS_START                  0        0                     1           1670214 │  │
│ │ FLAG_LAST_APPL_PER_CONTRACT                 0        0                     1           1670214 │  │
│ │ NAME_CASH_LOAN_PURPOSE                      0        0                     1           1725981 │  │
│ │ NAME_CONTRACT_STATUS                        0        0                     1           1696650 │  │
│ │ NAME_PAYMENT_TYPE                           0        0                   2.9           4801959 │  │
│ │ CODE_REJECT_REASON                          0        0                     1           1670214 │  │
│ │ NAME_TYPE_SUITE                        820405    49.12                  0.55            921358 │  │
│ │ NAME_CLIENT_TYPE                            0        0                     1           1670214 │  │
│ │ NAME_GOODS_CATEGORY                         0        0                   1.2           1959050 │  │
│ │ NAME_PORTFOLIO                              0        0                     1           1670214 │  │
│ │ NAME_PRODUCT_TYPE                           0        0                     1           1670214 │  │
│ │ CHANNEL_TYPE                                0        0                   2.5           4251465 │  │
│ │ NAME_SELLER_INDUSTRY                        0        0                   1.2           2074684 │  │
│ │ NAME_YIELD_GROUP                            0        0                     1           1670214 │  │
│ │ PRODUCT_COMBINATION                       346     0.02                     3           4973611 │  │
│ └────────────────────────────────────┴─────────────┴───────────┴────────────────────────┴────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

This file has ~1.0M records. Understandably, there’s more records than applications due to monthly summaries and some customers having multiple previous loans.

Data Types

  • A few variables are mapped as continuous but need to be moved to discrete (possibly categorical)
    • SELLERPLACE_AREA, NFLAG_INSURED_ON_APPROVAL, NFLAG_MICRO_CASH, NFLAG_LAST_APPL_IN_DAY, FLAG_LAST_APPL_PER_CONTRACT
  • A few variables mapped as string may be better suited as categorical
    • NAME_CONTRACT_TYPE, NAME_CONTRACT_STATUS, NAME_PAYMENT_TYPE, NAME_CLIENT_TYPE, NAME_GOODS_CATEGORY, etc.

Missing Data

  • There are several missing data points
    • AMT_ANNUITY refers to the previous application; if there was none or its balance is zero, makes sense to be missing
    • AMT_DOWN_PAYMENT would be zero if there was none made
    • RATE_DOWN_PAYMENT how much was put down relative to the loan (see above)
    • RATE_INTEREST_PRIMARY & RATE_INTEREST_PRIVILEGED; since this is applicationss, it could be there was previous credit
      • A high rate of missing data (99%)
    • CNT_PAYMENT how far into a previous loan for the current one
    • DAYS_FIRST_* refers to the first dispersement, due amount, etc; if no loan was approved, makes sense to be blank
    • NAME_TYPE_SUITE is the only discrete field with high NAs (49%); it indicates who accompanied the client in for the application

Distribution

  • Only two right-skewed data points I can see
    • DAYS_DECISION and RATE_INTEREST_PRIVILEGED (virtually no data on this one)

Potential feature engineering

  • Ultimately this will need to be aggregated; ways to represent the data?
    • Count of previous applications
    • Status/reject rates
    • Classify goods, maybe (% reasonable)

installments_payments.csv

Past payment data for each installments of previous credits in Home Credit related to loans in our sample.

These data will link to the current application and to past applications. Some clients will have previous loans, even multiple, while others may have none. We can see payments due and made in this file.

installment_pmt_df = (
    duckdb                                                      # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/installments_payments.csv'")      # Read the file
    .fetchdf()                                                  # Convert to a pandas dataframe
)

skim(installment_pmt_df)                                        # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│           Data Summary                 Data Types                                                               │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                        │
│ ┃ dataframe          Values   ┃ ┃ Column Type  Count ┃                                                        │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                        │
│ │ Number of rows    │ 13605401 │ │ float64     │ 5     │                                                        │
│ │ Number of columns │ 8        │ │ int32       │ 3     │                                                        │
│ └───────────────────┴──────────┘ └─────────────┴───────┘                                                        │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name     NA    NA %  mean     sd      p0       p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_PREV       0   0190300053620010000001434000189700023690002843000▇▇▇▇▇▇ │  │
│ │ SK_ID_CURR       0   0 278400102700 100000 189600 278700 367500 456300▇▇▇▇▇▇ │  │
│ │ NUM_INSTALMENT   0   0 0.8566 1.035      0      0      1      1    178 │  │
│ │ _VERSION       │      │      │         │        │         │         │         │         │         │        │  │
│ │ NUM_INSTALMENT   0   0  18.87 26.66      1      4      8     19    277  ▇▁   │  │
│ │ _NUMBER        │      │      │         │        │         │         │         │         │         │        │  │
│ │ DAYS_INSTALMEN   0   0  -1042 800.9  -2922  -1654   -818   -361     -1▂▃▃▃▅▇ │  │
│ │ T              │      │      │         │        │         │         │         │         │         │        │  │
│ │ DAYS_ENTRY_PAY29050.02  -1051 800.6  -4921  -1662   -827   -370     -1  ▁▃▃▇ │  │
│ │ MENT           │      │      │         │        │         │         │         │         │         │        │  │
│ │ AMT_INSTALMENT   0   0  17050 50570      0   4226   8884  167103771000 │  │
│ │ AMT_PAYMENT   29050.02  17240 54740      0   3398   8126  161103771000 │  │
│ └────────────────┴──────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

This file has ~1.4M records.

Data Types

  • We’ve got two ID columns that aren’t really continuous: SK_ID_PREV and SK_ID_CURR
  • All other variable data types look fine

Missing Data

  • DAYS_ENTR_PAYMENT and AMT_PAYMENT are mostly complete. The few missing values must be unpaid amounts, may due to default but perhaps due to the cutoff date

Distribution

  • DAYS_INSTALMENT seems to be skewed to the right

Potential feature engineering

  • We could probably synthesize this down to the installment payment timing

POS_CASH_balance.csv

Monthly balance of client’s previous loans in Home Credit

These data will link to the application and to past payment installments. Some clients will have previous loans, even multiple, while other may have none.

cash_bal_df = (
    duckdb                                              # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/POS_CASH_balance.csv'")   # Read the file
    .fetchdf()                                          # Convert to a pandas dataframe
)

skim(cash_bal_df)                                       # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│           Data Summary                 Data Types                                                               │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                        │
│ ┃ dataframe          Values   ┃ ┃ Column Type  Count ┃                                                        │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                        │
│ │ Number of rows    │ 10001358 │ │ int32       │ 5     │                                                        │
│ │ Number of columns │ 8        │ │ float64     │ 2     │                                                        │
│ └───────────────────┴──────────┘ │ string      │ 1     │                                                        │
│                                  └─────────────┴───────┘                                                        │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name    NA     NA %  mean     sd      p0       p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_PREV       0   0190300053580010000001434000189700023690002843000▇▇▇▇▇▇ │  │
│ │ SK_ID_CURR       0   0 278400102800 100000 189600 278700 367400 456300▇▇▇▇▇▇ │  │
│ │ MONTHS_BALANC    0   0 -35.01 26.07    -96    -54    -28    -13     -1▂▃▃▃▆▇ │  │
│ │ E             │       │      │         │        │         │         │         │         │         │        │  │
│ │ CNT_INSTALMEN260710.26  17.09    12      1     10     12     24     92 ▇▃▁▁  │  │
│ │ T             │       │      │         │        │         │         │         │         │         │        │  │
│ │ CNT_INSTALMEN260870.26  10.48 11.11      0      3      7     14     85 ▇▂▁   │  │
│ │ T_FUTURE      │       │      │         │        │         │         │         │         │         │        │  │
│ │ SK_DPD           0   0  11.61 132.7      0      0      0      0   4231 │  │
│ │ SK_DPD_DEF       0   0 0.6545 32.76      0      0      0      0   3595 │  │
│ └───────────────┴───────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                            NA     NA %       words per row             total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ NAME_CONTRACT_STATUS                     0        0                       1           10018377 │  │
│ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

This file has ~1.0M records. Understandably, there’s more records than applications due to monthly summaries and some customers having multiple previous loans.

Data Types

  • We’ve got two ID columns that aren’t really continuous: SK_ID_PREV and SK_ID_CURR
  • There’s opportunity the NAME_CONTRACT_STATUS variable to cast as a categorical

Missing Data

  • CNT_INSTALLMENT and CNT_INSTALLMENT_FUTURE are mostly complete. Where there exist missing values, it means there’s nothing outstanding in previous loans.

Distribution

  • MONTHS_BLANCE seems to be skewed to the right

Potential feature engineering

  • It’s possible this dataset can be represented in 2 variables:
    • Existence of previous loan: boolean
    • Median oustanding balance: numeric

credit_card_balance.csv

Monthly balance of client’s previous credit card loans in Home Credit

These data are all about credit card loans with Home Credit, specifically the balance that is carried.

credit_card_df = (
    duckdb                                                  # Use DuckDB's query engine
    .sql("SELECT * FROM 'data/credit_card_balance.csv'")    # Read the file
    .fetchdf()                                              # Convert to a pandas dataframe
)

skim(credit_card_df)                                        # "Skim" the data set
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                 Data Types                                                                │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                         │
│ ┃ dataframe          Values  ┃ ┃ Column Type  Count ┃                                                         │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                         │
│ │ Number of rows    │ 3840312 │ │ float64     │ 15    │                                                         │
│ │ Number of columns │ 23      │ │ int32       │ 7     │                                                         │
│ └───────────────────┴─────────┘ │ string      │ 1     │                                                         │
│                                 └─────────────┴───────┘                                                         │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_na                                                                                        ┃  │
│ ┃ me         NA      NA %   mean      sd       p0       p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_PRE     0    0 1905000 53650010000001434000189700023690002843000▇▇▇▇▇▇ │  │
│ │ V         │        │       │          │         │         │         │         │         │         │        │  │
│ │ SK_ID_CUR     0    0  278300 102700 100000 189500 278400 367600 456200▇▇▇▇▇▇ │  │
│ │ R         │        │       │          │         │         │         │         │         │         │        │  │
│ │ MONTHS_BA     0    0  -34.52  26.67    -96    -55    -28    -11     -1▂▃▃▃▅▇ │  │
│ │ LANCE     │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_BALAN     0    0   58300 106300-420300      0      0  890501506000  ▇▁   │  │
│ │ CE        │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_CREDI     0    0  153800 165100      0  45000 112500 1800001350000 ▇▁▁   │  │
│ │ T_LIMIT_A │        │       │          │         │         │         │         │         │         │        │  │
│ │ CTUAL     │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_DRAWI74981619.52    5961  28230  -6827      0      0      02115000 │  │
│ │ NGS_ATM_C │        │       │          │         │         │         │         │         │         │        │  │
│ │ URRENT    │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_DRAWI     0    0    7433  33850  -6212      0      0      02287000 │  │
│ │ NGS_CURRE │        │       │          │         │         │         │         │         │         │        │  │
│ │ NT        │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_DRAWI74981619.52   288.2   8202      0      0      0      01530000 │  │
│ │ NGS_OTHER │        │       │          │         │         │         │         │         │         │        │  │
│ │ _CURRENT  │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_DRAWI74981619.52    2969  20800      0      0      0      02239000 │  │
│ │ NGS_POS_C │        │       │          │         │         │         │         │         │         │        │  │
│ │ URRENT    │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_INST_305236 7.95    3540   5600      0      0      0   6634 202900 │  │
│ │ MIN_REGUL │        │       │          │         │         │         │         │         │         │        │  │
│ │ ARITY     │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_PAYME767988   20   10280  36080      0  152.4   2703   90004289000 │  │
│ │ NT_CURREN │        │       │          │         │         │         │         │         │         │        │  │
│ │ T         │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_PAYME     0    0    7589  32010      0      0      0   67504278000 │  │
│ │ NT_TOTAL_ │        │       │          │         │         │         │         │         │         │        │  │
│ │ CURRENT   │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_RECEI     0    0   55970 102500-423300      0      0  853601472000  ▇▁   │  │
│ │ VABLE_PRI │        │       │          │         │         │         │         │         │         │        │  │
│ │ NCIPAL    │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_RECIV     0    0   58090 106000-420300      0      0  889001493000  ▇▁   │  │
│ │ ABLE      │        │       │          │         │         │         │         │         │         │        │  │
│ │ AMT_TOTAL     0    0   58100 106000-420300      0      0  889101493000  ▇▁   │  │
│ │ _RECEIVAB │        │       │          │         │         │         │         │         │         │        │  │
│ │ LE        │        │       │          │         │         │         │         │         │         │        │  │
│ │ CNT_DRAWI74981619.52  0.3094    1.1      0      0      0      0     51 │  │
│ │ NGS_ATM_C │        │       │          │         │         │         │         │         │         │        │  │
│ │ URRENT    │        │       │          │         │         │         │         │         │         │        │  │
│ │ CNT_DRAWI     0    0  0.7031   3.19      0      0      0      0    165 │  │
│ │ NGS_CURRE │        │       │          │         │         │         │         │         │         │        │  │
│ │ NT        │        │       │          │         │         │         │         │         │         │        │  │
│ │ CNT_DRAWI74981619.520.0048120.08264      0      0      0      0     12 │  │
│ │ NGS_OTHER │        │       │          │         │         │         │         │         │         │        │  │
│ │ _CURRENT  │        │       │          │         │         │         │         │         │         │        │  │
│ │ CNT_DRAWI74981619.52  0.5595  3.241      0      0      0      0    165 │  │
│ │ NGS_POS_C │        │       │          │         │         │         │         │         │         │        │  │
│ │ URRENT    │        │       │          │         │         │         │         │         │         │        │  │
│ │ CNT_INSTA305236 7.95   20.83  20.05      0      4     15     32    120 ▇▃▂▁  │  │
│ │ LMENT_MAT │        │       │          │         │         │         │         │         │         │        │  │
│ │ URE_CUM   │        │       │          │         │         │         │         │         │         │        │  │
│ │ SK_DPD        0    0   9.284  97.52      0      0      0      0   3260 │  │
│ │ SK_DPD_DE     0    0  0.3316  21.48      0      0      0      0   3260 │  │
│ │ F         │        │       │          │         │         │         │         │         │         │        │  │
│ └───────────┴────────┴───────┴──────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                            NA     NA %       words per row             total words         ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ NAME_CONTRACT_STATUS                     0        0                       1            3840825 │  │
│ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

This file has ~3.8M records.

Data Types

  • We’ve got two ID columns that aren’t really continuous: SK_ID_PREV and SK_ID_CURR
  • There’s opportunity the NAME_CONTRACT_STATUS variable to cast as a categorical

Missing Data

  • There are several “count” variables with missing date (~20% missing)
    • CNT_INSTALLMENT_MATURE_CUM, CNT_DRAWINGS_POS_CURRENT, CNT_DRAWINGS_OTHER_CURRENT, CNT_DRAWINGS_ATM_CURRENT
    • I assume much of these are missing given there were now withdrawals for the categories
  • Some “amount” variables corresponding to the same categories above missing similar amounts of data

Distribution

  • CNT_INSTALLMENT_MATURE_CUM is very left skewed (understandable with a cumulative measure)
  • MONTHS_BALANCE is right skewed

Potential feature engineering

  • It’s possible this dataset can be represented in 2 variables:
    • Withdrawal amounts relative to limit
    • Withdrawal counts

Dataset Joins

With a fair understanding of the files we’re working with, let’s explore the central idea to the project: Home Credit wants to improve inclusivity of an underserved population. This group is generally without sufficient (or any) credit history. Let’s see if we can identify those applications via joining the datasets.

No Credit Bureau History

For this look, we need to join application_train.csv and bureau.csv. We’ll then aggregate to find out the distribution of credit history among Home Credit clients.

bureau_history = (
    duckdb.sql("""
        WITH 
        STEP1 AS (
            SELECT 
            at.SK_ID_CURR AS app_id
            ,COUNT(b.SK_ID_CURR) AS bureau_history_cnt

            FROM 'data/application_train.csv' at
            LEFT JOIN 'data/bureau.csv' b ON b.SK_ID_CURR = at.SK_ID_CURR

            GROUP BY ALL
        ) 
        SELECT 
        bureau_history_cnt
        ,COUNT(*) AS app_cnt

        FROM STEP1

        GROUP BY ALL

        ORDER BY bureau_history_cnt
    """).fetchdf()
)

bureau_history.head()
bureau_history_cnt app_cnt
0 0 44020
1 1 36072
2 2 35635
3 3 32925
4 4 28973
Note

I mentioned previously that DuckDB has tremendous syntax and we get our first glimpse in the above code block. Instead of needing to mirror the non-aggregated column names from the SELECT clause down to the GROUP BY clause, using GROUP BY ALL will perform this step for us (documentation). Glorious!

What we’ve done here is isolate the number of applications for which there is no history with the credit credit bureau (44,020 applications). Additionally, there are those with little credit history. How much is little? Let’s generate a summary to get a more true sense of distribution (we know it’s left-skewed) with the describe() method of a Pandas dataframe:

(bureau_history                                                                     # Take the history aggregated
    .assign(
        app_freq = lambda df: df['app_cnt'] / sum(df['app_cnt']),                   # Calculate the percent frequency
        adj_history_cnt = lambda df: df['app_freq'] * df['bureau_history_cnt']      # Calculate adjusted count of history
    )
    ['adj_history_cnt']                                                             # Select the new variable
    .sum()                                                                          # Get the total (our new average)
)
4.765114093479584

Here we learn that the average number of historical credit bureau records is <= 5. This is the average, so insufficien history is far more likely to be fewer than 3 (< 3) records of history. This, we’d probably say, is the definition of insufficient credit history.

Supplemental History

Now that we’ve identified the underserved population this project cares about, let’s see if we can’t confirm there exists other data points that could help assess credit worthiness.

In theory, Home Credit has previous application data on these individuals that may contribute. Let’s expand on the query above to see:

supplement_history = (
    duckdb.sql("""
        WITH 
        STEP1 AS (
            SELECT 
            at.SK_ID_CURR
            ,COUNT(b.SK_ID_CURR) AS bureau_history_cnt

            FROM 'data/application_train.csv' at
            LEFT JOIN 'data/bureau.csv' b ON b.SK_ID_CURR = at.SK_ID_CURR

            GROUP BY ALL
        ) 
        ,STEP2 AS (
            SELECT 
            s.SK_ID_CURR AS ap_id
            ,COUNT(pa.SK_ID_PREV) AS prev_app_history_cnt

            FROM STEP1 s
            LEFT JOIN 'data/previous_application.csv' pa ON pa.SK_ID_CURR = s.SK_ID_CURR

            WHERE 1 = 1
            AND bureau_history_cnt <= 5 -- Insufficient credit history

            GROUP BY ALL
        )
        SELECT
        prev_app_history_cnt
        ,COUNT(*) AS app_cnt

        FROM STEP2

        GROUP BY ALL

        ORDER BY prev_app_history_cnt
    """).fetchdf()
)

supplement_history.head()
prev_app_history_cnt app_cnt
0 0 12547
1 1 39345
2 2 32357
3 3 27080
4 4 21688

Despite most of this underserved population haing some previous application data with Home Credit we could use to supplement insufficient credit bureau, there’s at least 16,237 original applications without any previous history with Home Credit.

(supplement_history                                                                     # Take the history aggregated
    .assign(
        app_freq = lambda df: df['app_cnt'] / sum(df['app_cnt']),                       # Calculate the percent frequency
        adj_history_cnt = lambda df: df['app_freq'] * df['prev_app_history_cnt']        # Calculate adjusted count of history
    )
    ['adj_history_cnt']                                                                 # Select the new variable
    .sum()                                                                              # Get the total (our new average)
)
4.136952766398499

Again, we see the average number of previous application history with Home Credit to be <= 5. We’ll assume the same rule from above (< 3) for the “insufficient” group.

Insufficient History Options

We’ve established a group that has insufficient history from both the credit bureau and historical Home Credit records. The next step is to re-analyze the application_train.csv file only for this group. Any model we design has to be inclusive to this group and leverage what data exists therein.

Here, we’re restricting the data to our “insufficient” group and we’ll skim again:

insufficient_history = (
    duckdb.sql("""
        WITH 
        STEP1 AS (
            SELECT 
            at.SK_ID_CURR
            ,COUNT(DISTINCT b.SK_ID_BUREAU) AS bureau_history_cnt
            ,COUNT(DISTINCT pa.SK_ID_PREV) AS hc_history_cnt

            FROM 'data/application_train.csv' at
            LEFT JOIN 'data/bureau.csv' b ON b.SK_ID_CURR = at.SK_ID_CURR
            LEFT JOIN 'data/previous_application.csv' pa ON pa.SK_ID_CURR = at.SK_ID_CURR

            GROUP BY
            at.SK_ID_CURR
        )
        SELECT
        at.*

        FROM 'data/application_train.csv' at
        INNER JOIN STEP1 ON STEP1.SK_ID_CURR = at.SK_ID_CURR

        WHERE bureau_history_cnt < 3
        AND hc_history_cnt < 3
    """).fetchdf()
)

skim(insufficient_history)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 51288  │ │ float64     │ 65    │                                                          │
│ │ Number of columns │ 122    │ │ int32       │ 41    │                                                          │
│ └───────────────────┴────────┘ │ string      │ 16    │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_n                                                                                         ┃  │
│ ┃ ame       NA     NA %   mean      sd        p0        p25      p50      p75      p100     hist   ┃  │
│ ┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ SK_ID_CU    0    0  277100  102600  100000 188400 276200 365300 456300▇▇▇▇▇▇ │  │
│ │ RR       │       │       │          │          │          │         │         │         │         │        │  │
│ │ TARGET      0    0 0.08981  0.2859       0      0      0      0      1▇    ▁ │  │
│ │ CNT_CHIL    0    0  0.3923  0.7015       0      0      0      1      7  ▇▁   │  │
│ │ DREN     │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_INCO    0    0  152900   98990   25650  90000 135000 1800003375000 │  │
│ │ ME_TOTAL │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_CRED    0    0  574100  417000   45000 260600 473800 7819004050000  ▇▃   │  │
│ │ IT       │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_ANNU    3 0.01   26860   15890    2542  15330  24250  34160 225000  ▇▂   │  │
│ │ ITY      │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_GOOD   58 0.11  516800  383700   45000 225000 450000 6750004050000  ▇▂   │  │
│ │ S_PRICE  │       │       │          │          │          │         │         │         │         │        │  │
│ │ REGION_P    0    0 0.02162 0.015050.0005330.010010.018850.028660.07251▇▇▆▁ ▁ │  │
│ │ OPULATIO │       │       │          │          │          │         │         │         │         │        │  │
│ │ N_RELATI │       │       │          │          │          │         │         │         │         │        │  │
│ │ VE       │       │       │          │          │          │         │         │         │         │        │  │
│ │ DAYS_BIR    0    0  -15420    4680  -25230 -19420 -15050 -11330  -7676▃▆▇▇▇▇ │  │
│ │ TH       │       │       │          │          │          │         │         │         │         │        │  │
│ │ DAYS_EMP    0    0   64190  141100  -16840  -2054   -875   -220 365200▇    ▂ │  │
│ │ LOYED    │       │       │          │          │          │         │         │         │         │        │  │
│ │ DAYS_REG    0    0   -4979    3542  -23740  -7531  -4458  -1991      0  ▁▃▆▇ │  │
│ │ ISTRATIO │       │       │          │          │          │         │         │         │         │        │  │
│ │ N        │       │       │          │          │          │         │         │         │         │        │  │
│ │ DAYS_ID_    0    0   -2743    1553   -6235  -4196  -2801  -1322      0 ▇▆▅▆▆ │  │
│ │ PUBLISH  │       │       │          │          │          │         │         │         │         │        │  │
│ │ OWN_CAR_3562869.47   11.96   11.25       0      5      9     15     65 ▇▅▁   │  │
│ │ AGE      │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_MOB    0    0       10.004416       0      1      1      1      1 │  │
│ │ IL       │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_EMP    0    0  0.8199  0.3843       0      1      1      1      1▂    ▇ │  │
│ │ _PHONE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_WOR    0    0  0.2177  0.4127       0      0      0      0      1▇    ▂ │  │
│ │ K_PHONE  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_CON    0    0  0.9971 0.05364       0      1      1      1      1 │  │
│ │ T_MOBILE │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_PHO    0    0  0.2553  0.4361       0      0      0      1      1▇    ▃ │  │
│ │ NE       │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_EMA    0    0  0.0388  0.1931       0      0      0      0      1 │  │
│ │ IL       │       │       │          │          │          │         │         │         │         │        │  │
│ │ CNT_FAM_    2    0   2.085  0.9133       1      1      2      2      9 ▇▂▁   │  │
│ │ MEMBERS  │       │       │          │          │          │         │         │         │         │        │  │
│ │ REGION_R    0    0   2.015  0.5313       1      2      2      2      3▁  ▇ ▂ │  │
│ │ ATING_CL │       │       │          │          │          │         │         │         │         │        │  │
│ │ IENT     │       │       │          │          │          │         │         │         │         │        │  │
│ │ REGION_R    0    0   1.995  0.5245       1      2      2      2      3▂  ▇ ▁ │  │
│ │ ATING_CL │       │       │          │          │          │         │         │         │         │        │  │
│ │ IENT_W_C │       │       │          │          │          │         │         │         │         │        │  │
│ │ ITY      │       │       │          │          │          │         │         │         │         │        │  │
│ │ HOUR_APP    0    0   12.21   3.298       0     10     12     15     23 ▁▇▇▃  │  │
│ │ R_PROCES │       │       │          │          │          │         │         │         │         │        │  │
│ │ S_START  │       │       │          │          │          │         │         │         │         │        │  │
│ │ REG_REGI    0    0 0.02427  0.1539       0      0      0      0      1 │  │
│ │ ON_NOT_L │       │       │          │          │          │         │         │         │         │        │  │
│ │ IVE_REGI │       │       │          │          │          │         │         │         │         │        │  │
│ │ ON       │       │       │          │          │          │         │         │         │         │        │  │
│ │ REG_REGI    0    0 0.06543  0.2473       0      0      0      0      1▇    ▁ │  │
│ │ ON_NOT_W │       │       │          │          │          │         │         │         │         │        │  │
│ │ ORK_REGI │       │       │          │          │          │         │         │         │         │        │  │
│ │ ON       │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVE_REG    0    0 0.04816  0.2141       0      0      0      0      1 │  │
│ │ ION_NOT_ │       │       │          │          │          │         │         │         │         │        │  │
│ │ WORK_REG │       │       │          │          │          │         │         │         │         │        │  │
│ │ ION      │       │       │          │          │          │         │         │         │         │        │  │
│ │ REG_CITY    0    0 0.09774   0.297       0      0      0      0      1▇    ▁ │  │
│ │ _NOT_LIV │       │       │          │          │          │         │         │         │         │        │  │
│ │ E_CITY   │       │       │          │          │          │         │         │         │         │        │  │
│ │ REG_CITY    0    0  0.2582  0.4377       0      0      0      1      1▇    ▃ │  │
│ │ _NOT_WOR │       │       │          │          │          │         │         │         │         │        │  │
│ │ K_CITY   │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVE_CIT    0    0  0.1921  0.3939       0      0      0      0      1▇    ▂ │  │
│ │ Y_NOT_WO │       │       │          │          │          │         │         │         │         │        │  │
│ │ RK_CITY  │       │       │          │          │          │         │         │         │         │        │  │
│ │ EXT_SOUR3078960.03  0.4715  0.2159  0.0195 0.2937 0.4636 0.6457 0.9461▃▇▇▇▆▃ │  │
│ │ CE_1     │       │       │          │          │          │         │         │         │         │        │  │
│ │ EXT_SOUR  206  0.4  0.5056  0.19641.316e-0 0.3703 0.5563 0.6623  0.855▁▃▃▅▇▃ │  │
│ │ CE_2     │       │       │          │          │        6 │         │         │         │         │        │  │
│ │ EXT_SOUR2485948.47   0.549  0.19280.000527 0.4084 0.5815 0.7047  0.894▁▂▅▆▇▅ │  │
│ │ CE_3     │       │       │          │          │        3 │         │         │         │         │        │  │
│ │ APARTMEN2702452.69  0.1207  0.1125       0 0.0577 0.0897 0.1485      1  ▇▂   │  │
│ │ TS_AVG   │       │       │          │          │          │         │         │         │         │        │  │
│ │ BASEMENT3087160.19 0.08953 0.08507       0 0.0442 0.0765  0.113      1  ▇▁   │  │
│ │ AREA_AVG │       │       │          │          │          │         │         │         │         │        │  │
│ │ YEARS_BE2597250.64  0.9777 0.05906       0 0.9767 0.9819 0.9871      1 │  │
│ │ GINEXPLU │       │       │          │          │          │         │         │         │         │        │  │
│ │ ATATION_ │       │       │          │          │          │         │         │         │         │        │  │
│ │ AVG      │       │       │          │          │          │         │         │         │         │        │  │
│ │ YEARS_BU3481967.89  0.7536  0.1137       0 0.6872 0.7552 0.8232      1   ▂▇▃ │  │
│ │ ILD_AVG  │       │       │          │          │          │         │         │         │         │        │  │
│ │ COMMONAR3653071.23 0.04638 0.08186       0 0.0077 0.0211 0.0527      1 │  │
│ │ EA_AVG   │       │       │          │          │          │         │         │         │         │        │  │
│ │ ELEVATOR2818854.96 0.08437  0.1405       0      0      0   0.14      1  ▇▁   │  │
│ │ S_AVG    │       │       │          │          │          │         │         │         │         │        │  │
│ │ ENTRANCE2675952.17  0.1494  0.1016       0  0.069 0.1379 0.2069      1  ▇▃   │  │
│ │ S_AVG    │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLOORSMA2647651.62  0.2335  0.1534       0 0.1667 0.1667 0.3333      1 ▃▇▁   │  │
│ │ X_AVG    │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLOORSMI3549469.21  0.2386  0.1691       0 0.0833 0.2083  0.375      1▆▇▅▁▁  │  │
│ │ N_AVG    │       │       │          │          │          │         │         │         │         │        │  │
│ │ LANDAREA3130561.04 0.06611 0.08361       0 0.0173 0.0473 0.0854      1 │  │
│ │ _AVG     │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVINGAP3574169.69  0.1044 0.09666       0 0.0504 0.0756 0.1275      1  ▇▁   │  │
│ │ ARTMENTS │       │       │          │          │          │         │         │         │         │        │  │
│ │ _AVG     │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVINGAR2669452.05  0.1109  0.1151       0 0.0453 0.0756 0.1385      1  ▇▁   │  │
│ │ EA_AVG   │       │       │          │          │          │         │         │         │         │        │  │
│ │ NONLIVIN3627170.720.009464 0.05139       0      0      0 0.0053      1 │  │
│ │ GAPARTME │       │       │          │          │          │         │         │         │         │        │  │
│ │ NTS_AVG  │       │       │          │          │          │         │         │         │         │        │  │
│ │ NONLIVIN2920056.93 0.03025 0.07467       0      0  0.004 0.0301      1 │  │
│ │ GAREA_AV │       │       │          │          │          │         │         │         │         │        │  │
│ │ G        │       │       │          │          │          │         │         │         │         │        │  │
│ │ APARTMEN2702452.69  0.1171  0.1119       0 0.0525  0.084 0.1492      1  ▇▂   │  │
│ │ TS_MODE  │       │       │          │          │          │         │         │         │         │        │  │
│ │ BASEMENT3087160.19  0.0883 0.08704       0   0.04 0.0747 0.1132      1  ▇▁   │  │
│ │ AREA_MOD │       │       │          │          │          │         │         │         │         │        │  │
│ │ E        │       │       │          │          │          │         │         │         │         │        │  │
│ │ YEARS_BE2597250.64  0.9771 0.06404       0 0.9767 0.9816 0.9866      1 │  │
│ │ GINEXPLU │       │       │          │          │          │         │         │         │         │        │  │
│ │ ATATION_ │       │       │          │          │          │         │         │         │         │        │  │
│ │ MODE     │       │       │          │          │          │         │         │         │         │        │  │
│ │ YEARS_BU3481967.89  0.7607  0.1104       0 0.6994 0.7648 0.8301      1   ▂▇▃ │  │
│ │ ILD_MODE │       │       │          │          │          │         │         │         │         │        │  │
│ │ COMMONAR3653071.23 0.04395 0.08004       0 0.0069 0.0191 0.0497      1 │  │
│ │ EA_MODE  │       │       │          │          │          │         │         │         │         │        │  │
│ │ ELEVATOR2818854.96 0.07932  0.1374       0      0      0 0.1208      1  ▇▁   │  │
│ │ S_MODE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ ENTRANCE2675952.17  0.1445  0.1025       0  0.069 0.1379 0.2069      1  ▇▃   │  │
│ │ S_MODE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLOORSMA2647651.62  0.2289  0.1518       0 0.1667 0.1667 0.3333      1 ▃▇▁   │  │
│ │ X_MODE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLOORSMI3549469.21  0.2341  0.1689       0 0.0833 0.2083  0.375      1▆▇▅▁▁  │  │
│ │ N_MODE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ LANDAREA3130561.04 0.06452 0.08361       00.01515 0.0447 0.0834      1 │  │
│ │ _MODE    │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVINGAP3574169.69  0.1093  0.1023       0 0.0542  0.079 0.1322      1  ▇▁   │  │
│ │ ARTMENTS │       │       │          │          │          │         │         │         │         │        │  │
│ │ _MODE    │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVINGAR2669452.05  0.1092  0.1163       00.04262 0.0735  0.131      1  ▇▁   │  │
│ │ EA_MODE  │       │       │          │          │          │         │         │         │         │        │  │
│ │ NONLIVIN3627170.72 0.00863 0.04937       0      0      0 0.0039      1 │  │
│ │ GAPARTME │       │       │          │          │          │         │         │         │         │        │  │
│ │ NTS_MODE │       │       │          │          │          │         │         │         │         │        │  │
│ │ NONLIVIN2920056.93 0.02863 0.07492       0      0 0.0013 0.0247      1 │  │
│ │ GAREA_MO │       │       │          │          │          │         │         │         │         │        │  │
│ │ DE       │       │       │          │          │          │         │         │         │         │        │  │
│ │ APARTMEN2702452.69  0.1211  0.1133       0 0.05830.08875 0.1499      1  ▇▂   │  │
│ │ TS_MEDI  │       │       │          │          │          │         │         │         │         │        │  │
│ │ BASEMENT3087160.19 0.08905 0.08487       0 0.0438 0.0761 0.1126      1  ▇▁   │  │
│ │ AREA_MED │       │       │          │          │          │         │         │         │         │        │  │
│ │ I        │       │       │          │          │          │         │         │         │         │        │  │
│ │ YEARS_BE2597250.64  0.9777 0.05961       0 0.9767 0.9816 0.9871      1 │  │
│ │ GINEXPLU │       │       │          │          │          │         │         │         │         │        │  │
│ │ ATATION_ │       │       │          │          │          │         │         │         │         │        │  │
│ │ MEDI     │       │       │          │          │          │         │         │         │         │        │  │
│ │ YEARS_BU3481967.89  0.7569  0.1125       0 0.6914 0.7585 0.8256      1   ▂▇▃ │  │
│ │ ILD_MEDI │       │       │          │          │          │         │         │         │         │        │  │
│ │ COMMONAR3653071.23  0.0463 0.08155       0 0.0076 0.02090.05267      1 │  │
│ │ EA_MEDI  │       │       │          │          │          │         │         │         │         │        │  │
│ │ ELEVATOR2818854.96 0.08353  0.1405       0      0      0   0.14      1  ▇▁   │  │
│ │ S_MEDI   │       │       │          │          │          │         │         │         │         │        │  │
│ │ ENTRANCE2675952.17   0.149  0.1019       0  0.069 0.1379 0.2069      1  ▇▃   │  │
│ │ S_MEDI   │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLOORSMA2647651.62  0.2332  0.1537       0 0.1667 0.1667 0.3333      1 ▃▇▁   │  │
│ │ X_MEDI   │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLOORSMI3549469.21  0.2383  0.1698       0 0.0833 0.2083  0.375      1▆▇▅▁▁  │  │
│ │ N_MEDI   │       │       │          │          │          │         │         │         │         │        │  │
│ │ LANDAREA3130561.04 0.06697 0.08485       0 0.0173 0.0477 0.0866      1  ▇▁   │  │
│ │ _MEDI    │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVINGAP3574169.69  0.1056 0.09756       0 0.0513  0.077 0.1274      1  ▇▁   │  │
│ │ ARTMENTS │       │       │          │          │          │         │         │         │         │        │  │
│ │ _MEDI    │       │       │          │          │          │         │         │         │         │        │  │
│ │ LIVINGAR2669452.05  0.1122   0.117       0 0.0458 0.0763 0.1389      1  ▇▁   │  │
│ │ EA_MEDI  │       │       │          │          │          │         │         │         │         │        │  │
│ │ NONLIVIN3627170.720.009239 0.05063       0      0      0 0.0039      1 │  │
│ │ GAPARTME │       │       │          │          │          │         │         │         │         │        │  │
│ │ NTS_MEDI │       │       │          │          │          │         │         │         │         │        │  │
│ │ NONLIVIN2920056.93 0.03012 0.07527       0      0 0.0035  0.029      1 │  │
│ │ GAREA_ME │       │       │          │          │          │         │         │         │         │        │  │
│ │ DI       │       │       │          │          │          │         │         │         │         │        │  │
│ │ TOTALARE2571350.13  0.1057  0.1124       0 0.0412 0.0695 0.1361      1  ▇▂   │  │
│ │ A_MODE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ OBS_30_C  331 0.65   1.276   2.167       0      0      0      2     27  ▇▁   │  │
│ │ NT_SOCIA │       │       │          │          │          │         │         │         │         │        │  │
│ │ L_CIRCLE │       │       │          │          │          │         │         │         │         │        │  │
│ │ DEF_30_C  331 0.65  0.1326  0.4237       0      0      0      0      6  ▇▁   │  │
│ │ NT_SOCIA │       │       │          │          │          │         │         │         │         │        │  │
│ │ L_CIRCLE │       │       │          │          │          │         │         │         │         │        │  │
│ │ OBS_60_C  331 0.65    1.26   2.147       0      0      0      2     27  ▇▁   │  │
│ │ NT_SOCIA │       │       │          │          │          │         │         │         │         │        │  │
│ │ L_CIRCLE │       │       │          │          │          │         │         │         │         │        │  │
│ │ DEF_60_C  331 0.65 0.09169  0.3424       0      0      0      0      5  ▇▁   │  │
│ │ NT_SOCIA │       │       │          │          │          │         │         │         │         │        │  │
│ │ L_CIRCLE │       │       │          │          │          │         │         │         │         │        │  │
│ │ DAYS_LAS    1    0  -660.1     750   -4292   -876   -436    -47      0  ▁▁▂▇ │  │
│ │ T_PHONE_ │       │       │          │          │          │         │         │         │         │        │  │
│ │ CHANGE   │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    01.95e-050.004416       0      0      0      0      1 │  │
│ │ UMENT_2  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0  0.6386  0.4804       0      0      1      1      1▅    ▇ │  │
│ │ UMENT_3  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    05.849e-00.007648       0      0      0      0      1 │  │
│ │ UMENT_4  │       │       │        5 │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.01538  0.1231       0      0      0      0      1 │  │
│ │ UMENT_5  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.09546  0.2939       0      0      0      0      1▇    ▁ │  │
│ │ UMENT_6  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.000370 0.01924       0      0      0      0      1 │  │
│ │ UMENT_7  │       │       │        5 │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.08343  0.2765       0      0      0      0      1▇    ▁ │  │
│ │ UMENT_8  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.004835 0.06937       0      0      0      0      1 │  │
│ │ UMENT_9  │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0       0       0       0      0      0      0      0 │  │
│ │ UMENT_10 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.01037  0.1013       0      0      0      0      1 │  │
│ │ UMENT_11 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0       0       0       0      0      0      0      0 │  │
│ │ UMENT_12 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.004972 0.07034       0      0      0      0      1 │  │
│ │ UMENT_13 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.004738 0.06867       0      0      0      0      1 │  │
│ │ UMENT_14 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.00195 0.04411       0      0      0      0      1 │  │
│ │ UMENT_15 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.01503  0.1217       0      0      0      0      1 │  │
│ │ UMENT_16 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.00039 0.01974       0      0      0      0      1 │  │
│ │ UMENT_17 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    0 0.01322  0.1142       0      0      0      0      1 │  │
│ │ UMENT_18 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.000799 0.02826       0      0      0      0      1 │  │
│ │ UMENT_19 │       │       │        4 │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.000273 0.01652       0      0      0      0      1 │  │
│ │ UMENT_20 │       │       │          │          │          │         │         │         │         │        │  │
│ │ FLAG_DOC    0    00.000682 0.02611       0      0      0      0      1 │  │
│ │ UMENT_21 │       │       │        4 │          │          │         │         │         │         │        │  │
│ │ AMT_REQ_1859136.25 0.00682   0.087       0      0      0      0      3 │  │
│ │ CREDIT_B │       │       │          │          │          │         │         │         │         │        │  │
│ │ UREAU_HO │       │       │          │          │          │         │         │         │         │        │  │
│ │ UR       │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_REQ_1859136.250.008227  0.1321       0      0      0      0      8 │  │
│ │ CREDIT_B │       │       │          │          │          │         │         │         │         │        │  │
│ │ UREAU_DA │       │       │          │          │          │         │         │         │         │        │  │
│ │ Y        │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_REQ_1859136.25  0.0259  0.1865       0      0      0      0      6 │  │
│ │ CREDIT_B │       │       │          │          │          │         │         │         │         │        │  │
│ │ UREAU_WE │       │       │          │          │          │         │         │         │         │        │  │
│ │ EK       │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_REQ_1859136.25  0.1259  0.5168       0      0      0      0     15 │  │
│ │ CREDIT_B │       │       │          │          │          │         │         │         │         │        │  │
│ │ UREAU_MO │       │       │          │          │          │         │         │         │         │        │  │
│ │ N        │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_REQ_1859136.25  0.1567  0.4538       0      0      0      0     19 │  │
│ │ CREDIT_B │       │       │          │          │          │         │         │         │         │        │  │
│ │ UREAU_QR │       │       │          │          │          │         │         │         │         │        │  │
│ │ T        │       │       │          │          │          │         │         │         │         │        │  │
│ │ AMT_REQ_1859136.25  0.7666  0.9851       0      0      0      1      9 ▇▁▁   │  │
│ │ CREDIT_B │       │       │          │          │          │         │         │         │         │        │  │
│ │ UREAU_YE │       │       │          │          │          │         │         │         │         │        │  │
│ │ AR       │       │       │          │          │          │         │         │         │         │        │  │
│ └──────────┴───────┴───────┴──────────┴──────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA          NA %        words per row           total words        ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩  │
│ │ NAME_CONTRACT_TYPE                         0         0                     2            102576 │  │
│ │ CODE_GENDER                                0         0                     1             51288 │  │
│ │ FLAG_OWN_CAR                               0         0                     1             51288 │  │
│ │ FLAG_OWN_REALTY                            0         0                     1             51288 │  │
│ │ NAME_TYPE_SUITE                          320      0.62                     1             53444 │  │
│ │ NAME_INCOME_TYPE                           0         0                   1.3             66926 │  │
│ │ NAME_EDUCATION_TYPE                        0         0                   3.4            174134 │  │
│ │ NAME_FAMILY_STATUS                         0         0                   1.7             87595 │  │
│ │ NAME_HOUSING_TYPE                          0         0                   2.9            147099 │  │
│ │ OCCUPATION_TYPE                        16214     31.61                   1.1             57171 │  │
│ │ WEEKDAY_APPR_PROCESS_START                 0         0                     1             51288 │  │
│ │ ORGANIZATION_TYPE                          0         0                   2.1            107715 │  │
│ │ FONDKAPREMONT_MODE                     35708     69.62                  0.93             47729 │  │
│ │ HOUSETYPE_MODE                         26685     52.03                   1.4             73398 │  │
│ │ WALLSMATERIAL_MODE                     27053     52.75                  0.68             34667 │  │
│ │ EMERGENCYSTATE_MODE                    25298     49.33                  0.51             25990 │  │
│ └────────────────────────────────────┴────────────┴────────────┴────────────────────────┴────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

This leaves us with 51,288 applications in the “insufficient history” category. This means we’re left with the columns on the application_train.csv file as the best souce for predicting default. Let’s get some quick observations on this subset of applications:

  • Just as before, most items relating to a dwelling have a high rate of missing data
    • I’m curious to see if these data points are even helpful or not
    • There’s approximately 43 dwelling related predictors; if we take these and reduce to its principal components, we can test if there’s anything of predictive value in here
  • Mean income and variance thereof is lower in this sample versus the entire application_train.csv file, while credit amount and other measures are fairly consistent
    • I’m getting the impression that these applications are, in fact, very similar to the population
    • As the prompt said originally, due to insufficient credit history, these often falling victim to overly punitive terms by lenders

Underserved Applications

The crux of this problem is to improve the borrowing experience for this underserved group of applications. Therefore, if we find a pretty good model for this group, we’re likely to be creating a good model overall. Let’s dive into some predictive exploration for this group.

Documents

There’s approximately 20 FLAG_DOCUMENT_# features in the dataset. Let’s see if these are even valuable to us. For that, we’ll reduce to its principal components and then make a simple model.

We’ll load the libraries necessary for PCA:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Now let’s first get our data in the right format. We want 1) all of the FLAG_DOCUMENT_# columns in one dataframe and 2) TARGET in its own.

app_docs = insufficient_history.filter(like='FLAG_DOCUMENT_')       # Extract all FLAG_DOCUMENT variables
app_docs_scaled = StandardScaler().fit_transform(app_docs)          # Standardize each column
app_target = insufficient_history[['TARGET']]                       # Get the target variable

Now we can reduce these 20 variables to its principal components. Hopefully, a small fraction of those will explain 80-90% of the variance. This would allow us to use fewer variables in a test model. Either way, we’ll then be able to run a simple model to see how predictive these are.

import random
random.seed(814)

pca = PCA(n_components=app_docs_scaled.shape[1])        # Define the PCA
app_docs_pca = pca.fit_transform(app_docs_scaled)       # Run the PCA

exp_var = pca.explained_variance_ratio_                 # Get the explained variance for each principal component

print(exp_var[0:4])                                     # How much variance is explained by the first 5 components 
[0.08759671 0.06570449 0.05878099 0.0574298 ]

These first 5 components are only explaining about 1/3 of the variance. PCA isn’t helping us reduce the volume of variables. Instead, let’s just run a simple, logistic regression model with all these predictors modeling TARGET. I really like the statsmodels package as it facilitates very similar model summaries to R, which is huge for interpretability.

import statsmodels.api as sm

app_docs_mod = sm.Logit(app_target, app_docs_pca).fit()     # Fit a logistic regression

print(app_docs_mod.summary())                               # Print the model summary
Optimization terminated successfully.
         Current function value: 0.692194
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 TARGET   No. Observations:                51288
Model:                          Logit   Df Residuals:                    51270
Method:                           MLE   Df Model:                           17
Date:                Fri, 11 Oct 2024   Pseudo R-squ.:                  -1.291
Time:                        06:27:32   Log-Likelihood:                -35501.
converged:                       True   LL-Null:                       -15494.
Covariance Type:            nonrobust   LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0515      0.044      1.174      0.240      -0.035       0.138
x2            -0.0101     97.477     -0.000      1.000    -191.062     191.041
x3            -0.0101   1.16e+04  -8.71e-07      1.000   -2.26e+04    2.26e+04
x4            -0.0128   1.02e+05  -1.26e-07      1.000   -1.99e+05    1.99e+05
x5             0.0159    2.5e+04   6.37e-07      1.000   -4.89e+04    4.89e+04
x6            -0.0073   8.27e+04  -8.86e-08      1.000   -1.62e+05    1.62e+05
x7             0.0018    1.5e+05   1.17e-08      1.000   -2.95e+05    2.95e+05
x8            -0.0107   9.87e+04  -1.08e-07      1.000   -1.94e+05    1.94e+05
x9            -0.0062   1.02e+05  -6.08e-08      1.000      -2e+05       2e+05
x10           -0.0043   1.26e+05  -3.42e-08      1.000   -2.46e+05    2.46e+05
x11            0.0071    4.7e+04   1.51e-07      1.000   -9.22e+04    9.22e+04
x12           -0.0067   9.61e+04  -6.98e-08      1.000   -1.88e+05    1.88e+05
x13           -0.0030   7275.912  -4.07e-07      1.000   -1.43e+04    1.43e+04
x14            0.0108   8.35e+04    1.3e-07      1.000   -1.64e+05    1.64e+05
x15            0.0069   1.91e+04   3.59e-07      1.000   -3.75e+04    3.75e+04
x16           -0.0154   6661.497  -2.31e-06      1.000   -1.31e+04    1.31e+04
x17           -0.0246     82.629     -0.000      1.000    -161.974     161.925
x18            0.0732      0.017      4.417      0.000       0.041       0.106
x19         1.401e-18    1.3e+21   1.08e-39      1.000   -2.55e+21    2.55e+21
x20          2.05e-18   7.07e+20    2.9e-39      1.000   -1.39e+21    1.39e+21
==============================================================================

Clearly, this is a very bad model. We’ve got a negative \(R^2\) and only two of the components are even statistically significant (x1 and x18). We could sum the PCA explained variance from above for those features:

exp_var[0] + exp_var[17]        # Explained variance by the only two components that were statistically significant
0.10376992731867007

Only 10% of the variance is being explained by these statistically significant components. As is, these won’t be helpful in predicting our target variable. It’s possible a simple field indicating the count of documents submitted is somewhat helpful. But we’ll look at that later.

Dwelling

Let’s use a similar exercise on the variables related to dwelling. We have several versions with different measures of central tendency (average, median, and mode). Let’s go with the median versions. We know there’s a lot of missing values. Let’s get rid of those while we explore the predictive power of these variables.

app_dwell = insufficient_history.filter(like='_MEDI')       # Extract all the "median" variables related to dwelling
missing_idx = app_dwell.isnull().any(axis=1)                # Get index for missing values

app_dwell = app_dwell.loc[-missing_idx,:]                 # Remove missing rows from the dwelling variables
app_target_dwell = app_target.loc[-missing_idx,:]         # Remove missing rows from the target

app_dwell.shape
(12969, 14)

We’re left with 14 columns and no missing values. We don’t need to normalize these values since they came to us already scaled. If this complete version of dwelling variables isn’t decently predictive of TARGET, there’s no good argument for including these in some way in a larger model.

Let’s see if these can be consolidated with PCA.

pca = PCA(n_components=app_dwell.shape[1])              # Define the PCA
app_docs_pca = pca.fit_transform(app_dwell)             # Run the PCA

dwell_exp_var = pca.explained_variance_ratio_           # Get the explained variance for each principal component

print(dwell_exp_var[0:5])                               # How much variance is explained by the first 5 components 
[0.57225323 0.16350268 0.05873402 0.04947089 0.02826078]

Okay! The PCA is helping quite a bit here. We’ve got 87% of the variance explained in the first 5 components. That’s great info!

This would be a great time for a visualization and showing off the power of plotnine, a grammar of graphics visualization package by the same folks who designed ggplot2 for R. Let’s visualize the proportion of variance captured by each principal component.

First, we need to make a dataframe with the principal components we have in a list above.

dwell_pca_df = (
    pd.DataFrame({
        "component": list(map(lambda x: f'x{x}', range(1, 15))), 
        "variance": dwell_exp_var
    })
    .assign(variance_fmt = lambda df: df["variance"].map(lambda x: f'{x * 100:.1f}%'))
    .sort_values("variance", ascending = False)
    .reset_index(drop = True)
)

component_list = dwell_pca_df["component"]
component_cat = pd.Categorical(dwell_pca_df["component"], categories = component_list)
dwell_pca_df = dwell_pca_df.assign(component = component_cat)

dwell_pca_df
component variance variance_fmt
0 x1 0.572253 57.2%
1 x2 0.163503 16.4%
2 x3 0.058734 5.9%
3 x4 0.049471 4.9%
4 x5 0.028261 2.8%
5 x6 0.026983 2.7%
6 x7 0.025260 2.5%
7 x8 0.020493 2.0%
8 x9 0.016817 1.7%
9 x10 0.014299 1.4%
10 x11 0.011477 1.1%
11 x12 0.006752 0.7%
12 x13 0.003870 0.4%
13 x14 0.001827 0.2%
from plotnine import ggplot, geom_col, geom_text, aes, labs, theme_minimal, theme, element_blank, element_text

(
    ggplot(dwell_pca_df, aes("component", "variance", label = "variance_fmt"))
    + geom_col(fill = "#BE0000")
    + geom_text(va = "bottom")
    + labs(
        title = "Proportion of variance by principal component", 
        subtitle = "Dwelling related variables", 
        x = "Component"
    )
    + theme_minimal()
    + theme(
        panel_grid_major_x = element_blank(), 
        panel_grid_major_y = element_blank(), 
        axis_title_y = element_blank(), 
        axis_title_x = element_text(margin = {"t": 20,"r": 0,"b": 0,"l": 0,"units": "pt"}), 
        axis_text_y = element_blank()
    )
)

Instead of using the principal components as the predictors, let’s use the original variables in a logistic regression model (I’m curious which, if any, are predictive of TARGET).

app_dwell_mod = sm.Logit(app_target_dwell, app_dwell).fit()        # Fit a logistic regression

print(app_dwell_mod.summary())                                     # Print the model summary
Optimization terminated successfully.
         Current function value: 0.257421
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 TARGET   No. Observations:                12969
Model:                          Logit   Df Residuals:                    12955
Method:                           MLE   Df Model:                           13
Date:                Fri, 11 Oct 2024   Pseudo R-squ.:                 0.01294
Time:                        06:27:34   Log-Likelihood:                -3338.5
converged:                       True   LL-Null:                       -3382.2
Covariance Type:            nonrobust   LLR p-value:                 4.195e-13
================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
APARTMENTS_MEDI                 -0.4194      1.160     -0.362      0.718      -2.693       1.854
BASEMENTAREA_MEDI                0.2009      0.679      0.296      0.767      -1.130       1.532
YEARS_BEGINEXPLUATATION_MEDI    -2.1403      0.267     -8.022      0.000      -2.663      -1.617
YEARS_BUILD_MEDI                 0.2295      0.374      0.614      0.539      -0.503       0.962
COMMONAREA_MEDI                 -0.0328      0.619     -0.053      0.958      -1.247       1.181
ELEVATORS_MEDI                  -0.6470      0.624     -1.038      0.299      -1.869       0.575
ENTRANCES_MEDI                  -1.3946      0.594     -2.348      0.019      -2.559      -0.231
FLOORSMAX_MEDI                  -2.0031      0.523     -3.828      0.000      -3.029      -0.977
FLOORSMIN_MEDI                  -0.3427      0.346     -0.990      0.322      -1.021       0.336
LANDAREA_MEDI                    0.7573      0.493      1.537      0.124      -0.208       1.723
LIVINGAPARTMENTS_MEDI            0.1574      1.033      0.152      0.879      -1.867       2.182
LIVINGAREA_MEDI                  1.5549      0.940      1.654      0.098      -0.288       3.398
NONLIVINGAPARTMENTS_MEDI        -1.6355      1.460     -1.120      0.263      -4.496       1.225
NONLIVINGAREA_MEDI              -0.7401      0.750     -0.987      0.324      -2.211       0.730
================================================================================================

At first blush, none of these variables, with exception of FLOORSMAX_MEDI and BASEMENTAREA_MEDI, are any good. The \(R^2\) value is poor and most every variable has too high of a p-value. If you remember, 40-71% of these dwelling variables were missing anyway. It’s probably best to exclude these entirely from future models.

It is possible, however, that a simple field for flagging what % of dwelling data is available would be insightful. We’ll look into that later.

Categorical variables

Let’s turn our attention to some of the discrete fields. For these, we’ll want to evaluate the proportion of TARGET that span the categories. For example, if the same proportion of men default as women, gender may not, on its own, have much predictive power. The Chi-Square test would be helpful for this, too.

Let’s design a function, where we feed the data, a variable name, and we get back the results of the test:

import numpy as np

def run_chi_sq(data, var):
    # Get and print the cross tab
    data_crosstab = pd.crosstab(data[var], data["TARGET"])
    data_crosstab_prop = pd.crosstab(data[var], data["TARGET"], normalize='index')
    print(data_crosstab_prop)

    # Run the chi-square test
    observed_vals = data_crosstab.to_numpy()
    observed_table = sm.stats.Table(observed_vals)

    chi_sq_stat = observed_table.test_nominal_association()

    # Print chi-square test results
    print("\n-----------------------------------")
    print(f"Chi-square statistic: {chi_sq_stat.statistic}")
    print(f"P-Value: {chi_sq_stat.pvalue}")

This is also just a good way for us to get a sense for the unique values each of these fields entail, since skimpy hasn’t given us insight into that.

NAME_CONTRACT_TYPE

run_chi_sq(app_train_df, "NAME_CONTRACT_TYPE")
TARGET                     0         1
NAME_CONTRACT_TYPE                    
Cash loans          0.916541  0.083459
Revolving loans     0.945217  0.054783

-----------------------------------
Chi-square statistic: 293.53682101554654
P-Value: 0.0

CODE_GENDER

run_chi_sq(app_train_df, "CODE_GENDER")
TARGET              0         1
CODE_GENDER                    
F            0.930007  0.069993
M            0.898581  0.101419
XNA          1.000000  0.000000

-----------------------------------
Chi-square statistic: 920.4804789807738
P-Value: 0.0

FLAG_OWN_CAR

run_chi_sq(app_train_df, "FLAG_OWN_CAR")
TARGET               0         1
FLAG_OWN_CAR                    
N             0.914998  0.085002
Y             0.927563  0.072437

-----------------------------------
Chi-square statistic: 146.82528196185706
P-Value: 0.0

FLAG_OWN_REALTY

run_chi_sq(app_train_df, "FLAG_OWN_REALTY")
TARGET                  0         1
FLAG_OWN_REALTY                    
N                0.916751  0.083249
Y                0.920384  0.079616

-----------------------------------
Chi-square statistic: 11.624737274340792
P-Value: 0.0006508043252171358

NAME_INCOME_TYPE

run_chi_sq(app_train_df, "NAME_INCOME_TYPE")
TARGET                       0         1
NAME_INCOME_TYPE                        
Businessman           1.000000  0.000000
Commercial associate  0.925157  0.074843
Maternity leave       0.600000  0.400000
Pensioner             0.946136  0.053864
State servant         0.942450  0.057550
Student               1.000000  0.000000
Unemployed            0.636364  0.363636
Working               0.904115  0.095885

-----------------------------------
Chi-square statistic: 1251.8435948609447
P-Value: 0.0

NAME_EDUCATION_TYPE

run_chi_sq(app_train_df, "NAME_EDUCATION_TYPE")
TARGET                                0         1
NAME_EDUCATION_TYPE                              
Academic degree                0.981707  0.018293
Higher education               0.946449  0.053551
Incomplete higher              0.915150  0.084850
Lower secondary                0.890723  0.109277
Secondary / secondary special  0.910601  0.089399

-----------------------------------
Chi-square statistic: 1019.2131873088356
P-Value: 0.0

NAME_FAMILY_STATUS

run_chi_sq(app_train_df, "NAME_FAMILY_STATUS")
TARGET                       0         1
NAME_FAMILY_STATUS                      
Civil marriage        0.900554  0.099446
Married               0.924401  0.075599
Separated             0.918058  0.081942
Single / not married  0.901923  0.098077
Unknown               1.000000  0.000000
Widow                 0.941758  0.058242

-----------------------------------
Chi-square statistic: 504.98913222135616
P-Value: 0.0

NAME_HOUSING_TYPE

run_chi_sq(app_train_df, "NAME_HOUSING_TYPE")
TARGET                      0         1
NAME_HOUSING_TYPE                      
Co-op apartment      0.920677  0.079323
House / apartment    0.922043  0.077957
Municipal apartment  0.914603  0.085397
Office apartment     0.934276  0.065724
Rented apartment     0.876869  0.123131
With parents         0.883019  0.116981

-----------------------------------
Chi-square statistic: 420.55618983894647
P-Value: 0.0

ORGANIZATION_TYPE

run_chi_sq(app_train_df, "ORGANIZATION_TYPE")
TARGET                         0         1
ORGANIZATION_TYPE                         
Advertising             0.918415  0.081585
Agriculture             0.895273  0.104727
Bank                    0.948145  0.051855
Business Entity Type 1  0.918616  0.081384
Business Entity Type 2  0.914716  0.085284
Business Entity Type 3  0.907004  0.092996
Cleaning                0.888462  0.111538
Construction            0.883202  0.116798
Culture                 0.944591  0.055409
Electricity             0.933684  0.066316
Emergency               0.928571  0.071429
Government              0.930219  0.069781
Hotel                   0.935818  0.064182
Housing                 0.920554  0.079446
Industry: type 1        0.889317  0.110683
Industry: type 10       0.935780  0.064220
Industry: type 11       0.913462  0.086538
Industry: type 12       0.962060  0.037940
Industry: type 13       0.865672  0.134328
Industry: type 2        0.927948  0.072052
Industry: type 3        0.893838  0.106162
Industry: type 4        0.898518  0.101482
Industry: type 5        0.931553  0.068447
Industry: type 6        0.928571  0.071429
Industry: type 7        0.919663  0.080337
Industry: type 8        0.875000  0.125000
Industry: type 9        0.933195  0.066805
Insurance               0.943049  0.056951
Kindergarten            0.929651  0.070349
Legal Services          0.921311  0.078689
Medicine                0.934155  0.065845
Military                0.948747  0.051253
Mobile                  0.908517  0.091483
Other                   0.923575  0.076425
Police                  0.950021  0.049979
Postal                  0.915624  0.084376
Realtor                 0.893939  0.106061
Religion                0.941176  0.058824
Restaurant              0.882938  0.117062
School                  0.940852  0.059148
Security                0.900216  0.099784
Security Ministries     0.951368  0.048632
Self-employed           0.898261  0.101739
Services                0.933968  0.066032
Telecom                 0.923744  0.076256
Trade: type 1           0.910920  0.089080
Trade: type 2           0.930000  0.070000
Trade: type 3           0.896621  0.103379
Trade: type 4           0.968750  0.031250
Trade: type 5           0.938776  0.061224
Trade: type 6           0.954041  0.045959
Trade: type 7           0.905504  0.094496
Transport: type 1       0.955224  0.044776
Transport: type 2       0.921960  0.078040
Transport: type 3       0.842460  0.157540
Transport: type 4       0.907188  0.092812
University              0.951017  0.048983
XNA                     0.946004  0.053996

-----------------------------------
Chi-square statistic: 1609.2406359645197
P-Value: 0.0

Predictably, given the sample size, these are all statistically significant proportions of TARGET across the categories.

The most interesting, and likely predictive, is NAME_INCOME_TYPE. Naturally, one’s propensity to default on a loan is often tied to their funds flow. Still, values like Student never defaulting is fascinating.

ORGANIZATION_TYPE is fascinating. There’s a lot of of values here which isn’t ideal for a model. Looking at the values, there’s clearly some that are similar and others that aren’t. Could be a good opportunity for a clustering algorithm.

Clustering organizations

Let’s first prepare the data. Let’s generate two features for each category:

  • Proportion of observations mapped to default (TARGET value of 1)
  • Number of observatiosn total

That should be a decent input to a simple algorithm like Nearest Neighbors.

org_summary = duckdb.sql("""
    SELECT 
    ORGANIZATION_TYPE
    ,SUM(TARGET) / COUNT(*) AS PROP_DEFAULT
    ,COUNT(*) AS TOTAL_OBS
    FROM app_train_df
    GROUP BY
    ORGANIZATION_TYPE
""").fetchdf()

org_summary.head()
ORGANIZATION_TYPE PROP_DEFAULT TOTAL_OBS
0 Other 0.076425 16683
1 Trade: type 7 0.094496 7831
2 Medicine 0.065845 11193
3 Construction 0.116798 6721
4 Services 0.066032 1575

Let’s now construct the nearest neighbors algorithm and apply it to these measures.

from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph

org_X = org_summary.loc[:,["PROP_DEFAULT", "TOTAL_OBS"]].to_numpy()
knn_graph = kneighbors_graph(org_X, 5, include_self=False)

agg_clustering = AgglomerativeClustering(n_clusters=5, connectivity=knn_graph)
org_Labels = agg_clustering.fit_predict(org_X)

org_Labels
array([0, 0, 0, 0, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 4, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2], dtype=int64)

With the clustering complete, we can now take these labels, add them back into the organization summary and get a sense for which types of organizations are similar.

org_summary["ORG_CLASS"] = org_Labels

duckdb.sql("""
    SELECT
    ORG_CLASS
    ,STRING_AGG(ORGANIZATION_TYPE, ',') AS ORG_TYPES
    FROM org_summary
    GROUP BY
    ORG_CLASS
""").show()
┌───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ORG_CLASS │                                                                                                                                                                                                                                                                                                ORG_TYPES                                                                                                                                                                                                                                                                                                 │
│   int64   │                                                                                                                                                                                                                                                                                                 varchar                                                                                                                                                                                                                                                                                                  │
├───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│         2 │ Services,Industry: type 12,Industry: type 2,Transport: type 1,Agriculture,Industry: type 4,Trade: type 6,University,Industry: type 13,Trade: type 4,Bank,Security,Culture,Police,Trade: type 1,Housing,Industry: type 5,Religion,Military,Industry: type 11,Telecom,Industry: type 9,Electricity,Hotel,Advertising,Industry: type 6,Postal,Transport: type 2,Industry: type 3,Restaurant,Industry: type 1,Mobile,Cleaning,Legal Services,Industry: type 8,Trade: type 5,Trade: type 2,Trade: type 3,Security Ministries,Industry: type 7,Insurance,Transport: type 3,Emergency,Realtor,Industry: type 10 │
│         0 │ Other,Trade: type 7,Medicine,Construction,Business Entity Type 2,Kindergarten,School,Government,Business Entity Type 1,Transport: type 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                 │
│         1 │ Business Entity Type 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│         3 │ Self-employed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            │
│         4 │ XNA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
└───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Looking at these results, it wasn’t an ideal clusting of these types since we still have most types concentrated in two classes. However, mathematically, these are fairly similar in their relationship with TARGET. We are seeing some benefit as this would be a far better predictor that the original ORGANIZATION_TYPE.

Conclusion

Summary

There’s plenty more to explore with these data sets. As we move on from the exploratory data analysis phase and into modeling, there will continue to be emphasis placed on discovering insights.

At this point, however, we’ve reached some key milestones:

  1. We have a very thorough understanding of the data
    • What fields are available
    • How each data set connects to each other
  2. We’ve explored who is this cohort of under served applicants
    • What does “insufficient” credit history look like
    • How many applications meet this definition
    • How similar these sample of applications are to the population
  3. We also began exploration of predictive power
    • What fields of the application are likely distractions
    • How to reduce the dimensionality of continuous variables
  4. We also honed in on discrete values and the information they held
    • The unique range of values present in categorical variables
    • Approaches to clusting discrete variables for dimensionality reduction

Key Insights

Throughout this exploration of the data, a few key insights have surfaced:

  1. Data from past applications and credit bureau history applies to 6 of every 7 applications
  2. The remainder are those with “insufficient” history, defined as fewer than 3 previous records
  3. Applicantions with insufficient history are, as a whole, very similar to the population at large
  4. DOCUMENT and DWELLING related variables present little to no predictive power for default
  5. A handful of categorical variables seem like promising candidates for high predictive power
    • Namely: NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, and ORGANIZATION_TYPE